# Phase 1 - Extraction

Let's remember our data source's locations:

* `films.db` - project folder
* `customer_list.xlsx` - https://drive.google.com/drive/folders/1DefyOI3Qn7nthICx5t55DIBrc_PblsIQ?usp=sharing
* `stores.html` and `staff.html` - http://clevernet.pt/python_course/stores.html and http://clevernet.pt/python_course/staff.html
* `payments.json` - project folder
* `inventory-store-1.csv` and `inventory-store-2.csv` (zipped) - https://drive.google.com/drive/folders/1DefyOI3Qn7nthICx5t55DIBrc_PblsIQ?usp=sharing
* `rentals` files - FTP account with the following credentials: 
    * server: ftp.oom.pt
    * username: pyproj@oom.pt
    * password: J8rk#{4)L$~6YNnOZ%
    
So we'll have to find a way to deal with the following:

* connect to a database and work with data using SQL (in our case, SQLite)
* connect to Google Drive and retrieve files
* open and work with excel files
* scrape a website and extract information from it
* open and work with json files
* deal with zipped files
* open and work with CSV files
* connect to an FTP account and retrieve files
* open and work with TXT files

For now we'll just focus on gathering the data, not working with it.

Let's start coding!

## Extracting from sources

### SQL databases

There's multiple ways of interacting with a database in python: either by a specific connector package (like the python's *sqlite3* package) or using a package like *pandas* that allows you to query the database ad get data from it.

However, we'll go a different route, and we'll use a package called *sqlalchemy*. This package is called an ORM (Object Relational Mapper) and it provides a single syntax that you can use in a lot of different databases, such as SQLite, MySQL, MS SQL Server or Oracle DB.

In [None]:
# the create_engine() function will create a new db if it doesn't exist and open it if it does
from sqlalchemy import create_engine

engine = create_engine('sqlite:///data_sources/films.db') 

engine.connect()
 
print(engine)

In [None]:
from sqlalchemy import MetaData

# the MetaData object holds all the information about the database and the tables it contains. 
# we use an instance of it to create or drop tables in the database.
metadata = MetaData()

# we'll populate the instance with our db
metadata.reflect(engine)

In [None]:
# the tables property to list our tables
for table in metadata.tables:
    print(table)

In [None]:
# we can query the table details as well:
for table in metadata.tables:
    print(f'Information for table: {table}')
    table_obj = metadata.tables[table]
    print('-' * 40)
    for col in table_obj.columns:
        print(f"{col.name} - {col.type}")
    print()

In [None]:
from sqlalchemy.sql import text

# we can select data using connect():
with engine.connect() as conn:
    query = conn.execute(text('SELECT count(*) as row_count from film'))
    result = query.fetchone()
    print(result[0])

In [None]:
# or using a session:
from sqlalchemy.orm import sessionmaker

# initial configuration arguments
Session = sessionmaker(bind=engine)

# this session is bound to provided engine
session = Session()

query = session.query(metadata.tables['film'])
print(query.count())

In [None]:
# now we can update the output above with more information:
for table in metadata.tables:
    table_obj = metadata.tables[table]
    query = session.query(table_obj)
    print(f'Information for table: {table} - {query.count()} rows')
    print('-' * 40)
    for col in table_obj.columns:
        print(f"{col.name} - {col.type}")
    print()

In [None]:
# here's how we can actually select data. To select 10 films from the films table that were released in 2006, we could do this:
table_obj = metadata.tables['film']
query = session.query(table_obj).filter(table_obj.c.release_year == '2006').limit(10)
for row in query:
    print(row)

### Google Drive

To access google drive you have to have an account at google cloud, and then create an app to access the drive API. That's beyond the scope of this course and everything has been taken care of (the result is the `client_secrets.json` file in the *config* folder).

You'll probably need to install the pydrive package:

* if using anaconda: conda install -c conda-forge pydrive
* else: pip install pydrive

Before executing the cell below, here's some information of what is going to happen:

1. A new tab will open asking for credentials. Use these ones:
    * email: pthnprjct@gmail.com
    * password: Tdm1ZWfpvSXg44H85h
2. Press *continue* if you get an alert about the app being in test mode
3. Allow access if asked for permissions
4. Everything is ok when you get the message *The authentication flow has completed.*
5. You can close the tab

In [None]:
# connect to google drive
from pydrive.auth import GoogleAuth

GoogleAuth.DEFAULT_SETTINGS['client_config_file'] = 'configs/client_secrets.json'
gauth = GoogleAuth()
gauth.LocalWebserverAuth()

In [None]:
# after the first time authenticating, you can save your credentials
gauth.SaveCredentialsFile("configs/drive_credentials.txt")

In [None]:
# so actually you can change the connection and authentication process to this (with the option to refresh an expired access token):
from pydrive.auth import GoogleAuth

GoogleAuth.DEFAULT_SETTINGS['client_config_file'] = 'configs/client_secrets.json'
gauth = GoogleAuth()
# try to load saved client credentials
gauth.LoadCredentialsFile("configs/drive_credentials.txt")

if gauth.credentials is None:
    # authenticate if they're not there
    gauth.LocalWebserverAuth()
elif gauth.access_token_expired:
    # refresh them if expired
    gauth.Refresh()
else:
    # initialize the saved credentials
    gauth.Authorize()

# save the current credentials to a file
gauth.SaveCredentialsFile("configs/drive_credentials.txt")

In [None]:
# create a local instance of your drive
from pydrive.drive import GoogleDrive

drive = GoogleDrive(gauth)

# show files (excluding folders and deleted files) in the drive
file_list = drive.ListFile({'q': 'mimeType != "application/vnd.google-apps.folder" and trashed=false'}).GetList()

for file in file_list:
    print(file['title'], file['id'])

In [None]:
# download files
for file in file_list:
    if file['title'] == 'inventories.zip' or file['title'] == 'customer_list.xlsx':
        file = drive.CreateFile({'id': file['id']})
        file.GetContentFile(f"data_sources/{file['title']}")
        print(f"Successfully downloaded {file['title']}")

### Excel files

The easiest way of reading excel files is using the *pandas* package.

In [None]:
import pandas as pd

df_xlsx = pd.read_excel('data_sources/customer_list.xlsx')

df_xlsx.head()

### Zip files

In [None]:
from zipfile import ZipFile

with ZipFile('data_sources/inventories.zip', 'r') as zipObj:
   # extract all the contents of zip file in different directory
   zipObj.extractall('data_sources')

### Web scraping

To do scraping a minimal understanding of HTML is needed, since we'll be selecting portions of the page through the manipulation of HTML tags in the page.

#### Stores

In [None]:
# Let's request the page and save the response:
from bs4 import BeautifulSoup as bsoup
import requests

request = requests.get('http://clevernet.pt/python_course/stores.html').text
response = bsoup(request, 'html5lib')
    
print(response.prettify())

In [None]:
# extract the table from the page
table = response.find('table')
print(table.prettify())

In [None]:
# get all the rows inside the table
rows = table.find_all('tr')
print(rows)

In [None]:
# we'll save the information into a dict
data = {}

# iterate the rows to get the data we want
for i, row in enumerate(rows):
    # we don't need the first row
    if i != 0:
        # if it's the second row, it's the headers, else it's content (we'll treat them differently)
        if i == 1:
            # get all the cells inside the row
            cells = row.find_all('th')
            # iterate the cells
            for cell in cells:
                data[cell.get_text()] = []
        else:
            # get all the cells inside the row
            cells = row.find_all('td')
            # iterate the cells
            for j, cell in enumerate(cells):
                dict_keys = list(data.keys())
                data[dict_keys[j]].append(cell.get_text())
print(data)        

In [None]:
# now we can easily send this to a pandas dataframe
df_stores = pd.DataFrame.from_dict(data)

df_stores

In [None]:
# let's export the dataframe to pickle, because we'll need it later on
# pickle is a great way of saving your dataframe without having to convert it to another type
df_stores.to_pickle('data_sources/stores.pkl')

#### Staff

This will be almost the same as the stores, so the code will be more compressed.

In [None]:
from bs4 import BeautifulSoup as bsoup
import requests

request = requests.get('http://clevernet.pt/python_course/staff.html').text
response = bsoup(request, 'html5lib')

table = response.find('table')
rows = table.find_all('tr')

data = {}
for i, row in enumerate(rows):
    if i != 0:
        if i == 1:
            cells = row.find_all('th')
            for cell in cells:
                data[cell.get_text()] = []
        else:
            cells = row.find_all('td')
            for j, cell in enumerate(cells):
                dict_keys = list(data.keys())
                data[dict_keys[j]].append(cell.get_text())

df_staff = pd.DataFrame.from_dict(data)

df_staff.to_pickle('data_sources/staff.pkl')

df_staff

### JSON files

The easiest way of reading json files is using the *pandas* package.

In [None]:
df_json = pd.read_json('data_sources/payments.json')

df_json.head()

That didn't work as expected! This happens quite often because a json file might have multiple formats.

If you look at the raw json file you'll notice that all the records (each record is a dict) are the values of the key `Payments`. Furthermore, they are all inside a list.

So we have to do two things: 
* first, load the json file into a python dict using the *json* package (since it's a dict we can now access the part we're interested in)
* second, use `json_normalize` instead of `read_json`

In [None]:
import json

with open('data_sources/payments.json','r') as file:
    data = json.loads(file.read())
    
# notice that we're calling normalize not on the whole dict, but only on the values of the 'Payments' key
df_json = pd.json_normalize(data['Payments'])

df_json.head()

### CSV files

The easiest way of reading csv files is using the *pandas* package.

In [None]:
df_csv = pd.read_csv('data_sources/inventory-store-1.csv', delimiter=';')

df_csv.head()

Or you can use the *csv* package.

In [None]:
import csv

with open('data_sources/inventory-store-1.csv') as file:
    csv_reader = csv.reader(file, delimiter=';')
    for i, line in enumerate(csv_reader):
        if i < 6:
            print(line)
        else:
            break

### FTP

FTP stands for File Transfer Protocol. It's as if we were accessing a hard drive in another computer, where we can browse the files and folders and upload/download what we want.

Python has built-in support for FTP through the *ftplib* package.

In [None]:
from ftplib import FTP, all_errors
# this is a local import. We're importing the file ftp_credencials.py from the configs directory.
from configs import ftp_credentials

# connect to ftp server
try:
    ftp = FTP(ftp_credentials.config['server'], ftp_credentials.config['user'], ftp_credentials.config['pass'])
    print('Connected successfully.')
except KeyError:
    print('Missing credentials.')
except all_errors as e:
    print(f'Error while trying to connect -> {e}')
except Exception as e:
    print(f'Error -> {e}')

In [None]:
# list files/folders in server
listing = ftp.nlst()
print(listing)

In [None]:
# we want to enter the rentals folder and download all the files into our own rentals folder, inside the data_sources folder
import os
import shutil

# if folder is empty
if len(listing) == 0:
    print('The folder is empty.')
# switch to the data_sources folder    
os.chdir('data_sources')   

# create or empty the rentals folder
if not os.path.exists('rentals'):
    os.mkdir('rentals')
else:
    shutil.rmtree('rentals') # this deletes the folder and all the files inside recursively
    os.mkdir('rentals')

# swith to the rentals folder in your disk
os.chdir('rentals')
# swith to the rentals folder in the server
ftp.cwd('/rentals')
# list files in server folder
file_listing = ftp.nlst()
print(file_listing)

In [None]:
# handle each file
for filename in file_listing:
    if(filename.endswith('txt')):
        file = os.path.join(os.getcwd(), filename)
        try:
            with open(file, 'wb') as local_file:
                ftp.retrbinary('RETR ' + filename, local_file.write)
            print(f'Successfully downloaded {filename}.')   
        except all_errors as e:
            print(f'Error while trying to download -> {e}')

### Off-topic: knowing which folder is your current working directory (cwd) is very important

In the cells above we've changed our cwd using the `chdir()` method to facilitate our work. But we cannot forget that such a change will impact all the cells below, if we need to deal with paths.

So it's a good practice to always change back to our base folder after using `chdir()`.

Also bear in mind that if you run the cell below more than once, you'll keep going back in your path. If you wish to start over you have to restart the kernel.

In [None]:
# switch back to the main project folder, because the cells below are expecting you to be there
# this is where I am right now (the path before the 'course_project' folder will be different from yours obviously)
print(f'Before changing: {os.getcwd()}')
# To go back to the 'course_project' folder:
os.chdir('../..')
print(f'After changing: {os.getcwd()}')

### TXT files

You can read text files the exact same way as you read csv, using the *pandas* package.

In [None]:
df_txt = pd.read_csv('data_sources/rentals/rentals.txt', delimiter='|')

df_txt.head(20)

Not exactly the best outcome since we don't need the first line, but we'll focus on that in the next phase.

**Ok, so every data source is extracted. What can we still do to make our lives easier moving forward?**

Open *1.2-extraction-consolidation.ipynb* and let's move forward.