# Data Consolidation

Following up on the question from the previous notebook, here's what we decided to do:

* join all the text files and delete the uneeded row
* join the csv files

## Joining files

When joining files it's important to verify if the headers appear in every file or not. In our case they don't in the txt files, but they do in the csv files.

We'll use two different ways: standard python for the txt files and *pandas* for the csv files.

### TXT files

In [None]:
from os import listdir, chdir, getcwd
from os.path import isfile, join

# we're creating a list of all the files in the rentals folder
chdir('data_sources/rentals')
txt_files = [join(getcwd(), file) for file in listdir() if isfile(file)]
chdir('../..')

txt_files

In [None]:
# as you can see the files aren't sorted, and we need to parse the rentals.txt first because it's the one that contains the headers
txt_files = sorted(txt_files)
txt_files

It's obviously not a perfect sort, but it's enough for what we need.

In [None]:
# now we'll use that list to join all the files together
with open('data_sources/rentals_all.txt', 'w') as outfile:
    for file in txt_files:
        with open(file) as infile:
            outfile.write(infile.read())

# if the files are big and you can't place them all in memory, you'll have to read them line by line             
# with open('output_file', 'w') as outfile:
#    for file in filenames:
#        with open(file) as infile:
#            for line in infile:
#                outfile.write(line)

In [None]:
# let's check if it went well
import pandas as pd

df_txt = pd.read_csv('data_sources/rentals/rentals.txt', delimiter='|')
print(f"Row count of just one txt file: {df_txt.shape[0]}")

df_rentals = pd.read_csv('data_sources/rentals_all.txt', delimiter='|')
print(f"Row count of all the txt files we parsed now: {df_rentals.shape[0]}")

So if each file has 1000 rows, to get 16045 rows we would need 17 files. Sounds about right!

In [None]:
df_rentals.head()

In [None]:
# now we can delete that first row
df_rentals.drop(index=df_rentals.index[0], axis=0, inplace=True)

df_rentals.head()

### CSV files

In [None]:
# load both csv's into dataframes
df_csv_1 = pd.read_csv('data_sources/inventory-store-1.csv', delimiter=';')
df_csv_2 = pd.read_csv('data_sources/inventory-store-2.csv', delimiter=';')

# add them to a list and call the concat() method
frames = [df_csv_1, df_csv_2]
df_csv_final = pd.concat(frames)

# verify if it's all good. Notice we're using the python function len(), an alternative to the shape property in pandas.
print(f"Row count of csv 1 + csv 2: {len(df_csv_1) + len(df_csv_2)}")
print(f"Row count of csv's combined: {df_csv_final.shape[0]}")

In [None]:
# export to a csv file
df_csv_final.to_csv('data_sources/inventory-store-all.csv', index=False)

## Deciding on an approach to build the data lake

Since we're dealing with a small amount of data and we can access it from python, we could end the process right here and assume that our data lake is ready for the next phase. But since the next phase comprises building a unified data model, we'll have to do a lot of joins. 

We could certainly use *pandas* for that, but with a little bit of SQL knowledge we can get the task done in a much more easier way. Furthermore, we already have a database in our datasources, so why don't we capitalize on that and use it to load the other datasources into it?

Since it's an SQLite database, the fastest and easier way to load records is by using csv files. We already have the following ready to go:

* the *films.db*, our database
* *inventories-stores-all.csv*

So here's what we have to do next:

* convert the *customer_list.xlsx* to csv
* convert the *payments.json* to csv
* convert the *rentals_all.txt* to csv
* export the *staff* and *stores* data to csv

## Converting/Exporting data

### Excel, JSON and TXT files

Notice that we'll be using `index=False` when converting to csv since we don't need the index column that was automatically generated by *pandas*.

In [None]:
# excel - we're simply reusing the code from the previous notebook to create the dataframe
df_xlsx = pd.read_excel('data_sources/customer_list.xlsx')
df_xlsx.to_csv('data_sources/customer_list.csv', index=False)

# json - we're simply reusing the code from the previous notebook to create the dataframe
import json

with open('data_sources/payments.json','r') as file:
    data = json.loads(file.read())
    
df_json = pd.json_normalize(data['Payments'])
df_json.to_csv('data_sources/payments.csv', index=False)

# txt - we're using the dataframe created some cells above
df_rentals.to_csv('data_sources/rentals_all.csv', index=False)

# check the data_sources folder for the created files

### Web scraped data

We'll use the pickle files created earlier to export the scraped data.

In [None]:
df_stores = pd.read_pickle('data_sources/stores.pkl')
df_stores.to_csv('data_sources/stores.csv', index=False)

df_staff = pd.read_pickle('data_sources/staff.pkl')
df_staff.to_csv('data_sources/staff.csv', index=False)

# check the data_sources folder for the created files

## Building the data lake

It's time to send all these csv's to the *films.db* to finally start inspecting the data.

In [None]:
# connect to the database
from sqlalchemy import create_engine

engine = create_engine('sqlite:///data_sources/films.db') 

engine.connect()

There are several ways to do this, and we'll use two of them. The first is using *pandas*, it's suitable when the csv is small, and will parse half of our csv's:

In [None]:
# csv's to parse:
to_parse = {
    'store': 'stores.csv',
    'staff': 'staff.csv',
    'rental': 'rentals_all.csv',
    'payment': 'payments.csv',
    'inventory': 'inventory-store-all.csv',
    'customer': 'customer_list.csv'
}

for key, value in to_parse.items():
    # we're replacing any existing records in the table with the records from our csv
    pd.read_csv(f'data_sources/{value}').to_sql(key, engine, if_exists='replace', index=False)
    print(f'Successfully added {value} to database')

The next one is directly using the SQLite app in your operating system, totally bypassing python, therefore much more performant and suitable for large files. We'll just use python to call the app, and that's it.

> __Note__: this might not work in Windows unless you use a unix-like shell such as Cygwin.

In [None]:
import os

# csv's to parse:
to_parse = {
    'payment': 'payments.csv',
    'inventory': 'inventory-store-all.csv',
    'customer': 'customer_list.csv'
}

for key, value in to_parse.items():
    # drop the table if it already exists
    #drop_cmd = f"""sqlite3 {os.path.join(os.getcwd(), 'data_sources', 'films.db')} <<< "drop table if exists {key}" """
    #print(drop_cmd)
    #os.system(drop_cmd)
    
    #import the csv
    import_cmd = f"(echo .separator ,; echo .import {os.path.join(os.getcwd(), 'data_sources', value)} {key})"
    import_cmd += f" | sqlite3 {os.path.join(os.getcwd(), 'data_sources', 'films.db')}"
    #print(import_cmd)
    os.system(import_cmd)

## Quick data lake inspection

In [None]:
from sqlalchemy import MetaData
from sqlalchemy.orm import sessionmaker

metadata = MetaData()
metadata.reflect(engine)

Session = sessionmaker(bind=engine)
session = Session()

for table in metadata.tables:
    table_obj = metadata.tables[table]
    query = session.query(table_obj)
    print(f'Information for table: {table} - {query.count()} rows')
    print('-' * 40)
    for col in table_obj.columns:
        print(f"{col.name} - {col.type}")
    print()

**Notice any difference between the tables imported with *pandas* and the other ones?**