The goal of the data engineer is to unlock an organization's data ecosystem to a wide group of analysts, data scientists, or any other interested member. Sure, we could share the SQLite file around to every interested user, but what if there were 10,000 people interested in the data? SQLite does not scale well for that use case so we require a better choice.

To serve such a vast amount of users, we would be better off using another database engine. In this course we'll use an open source relational database management system (RDBMS) called **Postgres**. Postgres is a much more robust engine that is implemented as a server rather than a single file. As a server, Postgres accepts connections from clients who can run queries like a **SELECT, INSERT**, or any other type of SQL query making the data accessible to a wide range of people.

Using this model, Postgres can handle multiple connections to the database solving one of the main data engineering challenges.

<Img src="https://github.com/rhnyewale/Data-Engineering/blob/main/Images/postgres1.jpg?raw=true">
    
The above diagram illustrates the client-server model used by Postgres. Two users, Rose and Bruno can both be connected to the same Postgres server and access the databases it contains.

Let's explore necessary skills to interact and manage a Postgres database. This is a fundamental skill that any data engineer should have, as one of the main roles of a data engineer is ensure that data is readily available and stored in a way that makes accessing it easy and efficient. We will start by learning how to connect to a Postgres database and run simple SQL queries.
    


# Introduction
Create a table for storing data representing user accounts. The dataset that we will be using is stored in a CSV file named user_accounts.csv. Its data does not correspond to real users, it was randomly generated data using faker.

In order to communicate with our Postgres server, we will be using the open source psycopg2 Python library. You can think of **psycopg2** being similar to connecting to a SQLite database using the sqlite3 library.

To connect to the database we use the *psycopg2.connect()* function by passing it a string containing the name of the database to which we want to connect to as well as our username. So, to connect to a database named dq using psycopg2, a user named Rose we would do the following:

import psycopg2

conn = psycopg2.connect("dbname=dq user=Rose")

As you see, Rose connects to the database by specifying the database name dbname and a user user in the psycopg2.connect() function. The string "dbname=dq user=Rose" is referred to as connection string. In the above example, the connection string specifies that Rose wants to connect to a database named dq using her username, Rose.

Because Postgres supports multiple simultaneous connections, Postgres uses multiple users and databases as a way to improve security and division of data. Without those values attached, Postgres will not know who is trying to connect and where so it will fail. Once Rose is connected, she is ready to take advantage of the features Postgres has.

Once she's finished doing what she wants to do with the database, Rose should close the connection to avoid leaving useless, resource consuming connections opened. To do so, she can use the connection.close() method:

conn.close()

In [2]:
pip install psycopg2

Collecting psycopg2
  Downloading psycopg2-2.8.6-cp37-cp37m-win_amd64.whl (1.1 MB)
Installing collected packages: psycopg2
Successfully installed psycopg2-2.8.6
Note: you may need to restart the kernel to use updated packages.
