## Data Engineering Capstone Project


### Step 2. Explore and Assess the Data

The purpose of this notebook is to read in the relevant data, and assess the following attributes of each data source;

* Data schema.
* Size of each data source.
* Quality of each data source.

As described in the README file, for each data source, we will read it into a data frame using pandas, and subsequently analyse the attributes. Pandas was chosen to read the data in such as to enable ease of use with airflow. 

In [None]:
import pandas as pd
import os
import datetime

#### Immigration Dataset

The immigration dataset is stored in a series of parquet files. They are stored in `data/immigration-data/`. We are going to read them in using spark and analyse the schema.

In [None]:
	
## read in the parquet files from the directory
data_directory = 'data/immigration-data'
data_files = data_files = [os.path.join(data_directory, f) for f in os.listdir(data_directory)]

dfs = []

for f in data_files:
    _df = pd.read_parquet(f)
    dfs.append(_df)

immigration_data = pd.concat(dfs)

In [None]:
## get the immigration data columns
immigration_data.columns

In [None]:
## get the first 10 rows
immigration_data.head(10)

In [None]:
## get the full length
len(immigration_data)

In [None]:
immigration_data = immigration_data.drop_duplicates('admnum')
len(immigration_data)

In [None]:
## analyse the timestamp rows
def show_ts_columns(df):
    print(df[['arrdate', 'depdate', 'dtaddto']].head(50))
    
show_ts_columns(immigration_data)

In [None]:
## convert sas timestamp to date
def convert_sas_timestamp(column_name, df):
    df[column_name] = pd.to_timedelta(df[column_name], unit='D') + pd.Timestamp('1960-1-1')
    return df

In [None]:
## convert arrival date and departure dates 
immigration_data = convert_sas_timestamp('arrdate', immigration_data)
immigration_data = convert_sas_timestamp('depdate', immigration_data)

show_ts_columns(immigration_data)

In [None]:
## filter where the immigration departure is valid
immigration_data = immigration_data[immigration_data['dtaddto'].str.len() == 8]

In [None]:
# convert the datetime column
immigration_data['dtaddto'] = pd.to_datetime(immigration_data['dtaddto'], format="%m%d%Y", errors='coerce')

In [None]:
show_ts_columns(immigration_data)

#### Temperature Data

The temperature data is divided into four csv files;

* GlobalTemperatures.csv
* GlobalLandTemperaturesByCity.csv
* GlobalLandTemperaturesByCountry.csv
* GlobalLandTemperaturesByMajorCity.csv
* GlobalLandTemperaturesByState.csv

For each of the csv files, we will read them in using pandas, we will get the schema, print the first 10 rows of the data, and display the count.

In [None]:
## base path for the csv files
base_path = './data/climate-change'

## list of the files
import os
import pandas as pd

file_names = ['GlobalTemperatures', 
#               'GlobalLandTemperaturesByCity', 
              'GlobalLandTemperaturesByCountry',
              'GlobalLandTemperaturesByMajorCity',
              'GlobalLandTemperaturesByState']

for data_source in file_names:
    data_dest = os.path.join(base_path, f'{data_source}.csv')
    print(f'== Analysing Data Source:: {data_source} :: File Path :: {data_dest} ==')
          
    data_df = pd.read_csv(data_dest)
            
    ## print the schema
    print('\n** SCHEMA **\n')
    print(list(data_df))
    print()

    data_df.columns = ['ts', 
                       'average_temperature',
                       'country_code']

    ## get the first 10 rows
    print('\n** FIRST 10 ROWS **\n')
    print(data_df.head(10))
    print()

    ## get the count
    print('\n** NUMBER OF ROWS **\n')
    print(len(data_df))
    print()

    

#### Demographics

The demographics dataset contains information about the demographics of all US cities. We will read in the csv files using pandas and get the schema, first 10 rows, and the row count.

In [None]:
file_path = './data/demographics/us-cities-demographics.csv'

demographics_df = pd.read_csv(file_path, delimiter=";")

## get the schema
print('\n** SCHEMA **\n')
print(list(demographics_df))
print()

## get the columns

print(f'Before::{demographics_df.columns}')

df_cols = ['city',
           'state',
           'median_age',
           'male_population',
           'female_population',
           'total_population',
           'number_of_veterans',
           'foreign_born',
           'average_household_size',
           'state_code',
           'race',
           'count']

print(f'NewColumns::{df_cols}::{len(df_cols)}')

demographics_df.columns = df_cols

print(f'After::{demographics_df.columns}')

## get the first 10 rows
print('\n** FIRST 10 ROWS **\n')
print(demographics_df.head(10))
print()

## get the row count
print('\n** ROW COUNT **\n')
print(len(demographics_df))
print()

#### Airport Codes

The airport codes dataset contains airport codes, and corresponding cities

We will read in the `.csv` file using pandas, get the schema, the first 10 rows, and the length of the dataset.

In [None]:
file_path = './data/airport-codes/airport-codes_csv.csv'

airport_codes_df = pd.read_csv(file_path)

print('\n** SCHEMA **\n')
print(list(airport_codes_df))
print()

print('\n** FIRST 10 ROWS **\n')
print(airport_codes_df.head(10))
print()

print('\n** ROW COUNT **\n')
print(len(airport_codes_df))
print()