## Data Engineering Capstone Project


### Step 2. Explore and Assess the Data

The purpose of this notebook is to read in the relevant data, and assess the following attributes of each data source;

* Data schema.
* Size of each data source.
* Quality of each data source.

As described in the README file, for each data source, we will read it into a data frame using spark, and subsequently analyse the attributes. Spark was chosen to read the data in such as to enable versatility if the size of the data was significantly increased. To read in the `.SAS` file, we had to use the following [plugin](https://spark-packages.org/package/saurfang/spark-sas7bdat).

In [8]:
from pyspark.sql import SparkSession
spark = SparkSession.builder.\
                     getOrCreate()

#### Immigration Dataset

The immigration dataset is stored in a series of parquet files. They are stored in `data/immigration-data/`. We are going to read them in using spark and analyse the schema.

In [9]:
	
## read in the parquet files from the directory
immigration_df =spark.read.parquet('./data/immigration-data/')

In [10]:
## create a temporary view
immigration_df.createOrReplaceTempView('immigration')

In [11]:
## get the schema
immigration_df.printSchema()

root
 |-- cicid: double (nullable = true)
 |-- i94yr: double (nullable = true)
 |-- i94mon: double (nullable = true)
 |-- i94cit: double (nullable = true)
 |-- i94res: double (nullable = true)
 |-- i94port: string (nullable = true)
 |-- arrdate: double (nullable = true)
 |-- i94mode: double (nullable = true)
 |-- i94addr: string (nullable = true)
 |-- depdate: double (nullable = true)
 |-- i94bir: double (nullable = true)
 |-- i94visa: double (nullable = true)
 |-- count: double (nullable = true)
 |-- dtadfile: string (nullable = true)
 |-- visapost: string (nullable = true)
 |-- occup: string (nullable = true)
 |-- entdepa: string (nullable = true)
 |-- entdepd: string (nullable = true)
 |-- entdepu: string (nullable = true)
 |-- matflag: string (nullable = true)
 |-- biryear: double (nullable = true)
 |-- dtaddto: string (nullable = true)
 |-- gender: string (nullable = true)
 |-- insnum: string (nullable = true)
 |-- airline: string (nullable = true)
 |-- admnum: double (nullable = 

In [12]:
## get the first 10 rows
spark.sql("select * from immigration limit 10").show()

+---------+------+------+------+------+-------+-------+-------+-------+-------+------+-------+-----+--------+--------+-----+-------+-------+-------+-------+-------+--------+------+------+-------+--------------+-----+--------+
|    cicid| i94yr|i94mon|i94cit|i94res|i94port|arrdate|i94mode|i94addr|depdate|i94bir|i94visa|count|dtadfile|visapost|occup|entdepa|entdepd|entdepu|matflag|biryear| dtaddto|gender|insnum|airline|        admnum|fltno|visatype|
+---------+------+------+------+------+-------+-------+-------+-------+-------+------+-------+-----+--------+--------+-----+-------+-------+-------+-------+-------+--------+------+------+-------+--------------+-----+--------+
|5748517.0|2016.0|   4.0| 245.0| 438.0|    LOS|20574.0|    1.0|     CA|20582.0|  40.0|    1.0|  1.0|20160430|     SYD| null|      G|      O|   null|      M| 1976.0|10292016|     F|  null|     QF|9.495387003E10|00011|      B1|
|5748518.0|2016.0|   4.0| 245.0| 438.0|    LOS|20574.0|    1.0|     NV|20591.0|  32.0|    1.0|  

In [13]:
## get the size of the dataset
spark.sql('select count(*) from immigration').show()

+--------+
|count(1)|
+--------+
| 3096313|
+--------+



#### Temperature Data

The temperature data is divided into four csv files;

* GlobalTemperatures.csv
* GlobalLandTemperaturesByCity.csv
* GlobalLandTemperaturesByCountry.csv
* GlobalLandTemperaturesByMajorCity.csv
* GlobalLandTemperaturesByState.csv

For each of the csv files, we will read them in using pandas, we will get the schema, print the first 10 rows of the data, and display the count.

In [33]:
## base path for the csv files
base_path = './data/climate-change'

## list of the files
import os
import pandas as pd

file_names = ['GlobalTemperatures', 
              'GlobalLandTemperaturesByCity', 
              'GlobalLandTemperaturesByCountry',
              'GlobalLandTemperaturesByMajorCity',
              'GlobalLandTemperaturesByState']

for data_source in file_names:
    data_dest = os.path.join(base_path, f'{data_source}.csv')
    print(f'== Analysing Data Source:: {data_source} :: File Path :: {data_dest} ==')
          
    data_df = pd.read_csv(data_dest)
        
    ## print the schema
    print('\n** SCHEMA **\n')
    print(list(data_df))
    print()
    
    ## get the first 10 rows
    print('\n** FIRST 10 ROWS **\n')
    print(data_df.head(10))
    print()
    
    ## get the count
    print('\n** NUMBER OF ROWS **\n')
    print(len(data_df))
    print()

== Analysing Data Source:: GlobalTemperatures :: File Path :: ./data/climate-change/GlobalTemperatures.csv ==

** SCHEMA **

['dt', 'LandAverageTemperature', 'LandAverageTemperatureUncertainty', 'LandMaxTemperature', 'LandMaxTemperatureUncertainty', 'LandMinTemperature', 'LandMinTemperatureUncertainty', 'LandAndOceanAverageTemperature', 'LandAndOceanAverageTemperatureUncertainty']


** FIRST 10 ROWS **

           dt  LandAverageTemperature  LandAverageTemperatureUncertainty  \
0  1750-01-01                   3.034                              3.574   
1  1750-02-01                   3.083                              3.702   
2  1750-03-01                   5.626                              3.076   
3  1750-04-01                   8.490                              2.451   
4  1750-05-01                  11.573                              2.072   
5  1750-06-01                  12.937                              1.724   
6  1750-07-01                  15.868                        

#### Demographics

The demographics dataset contains information about the demographics of all US cities. We will read in the csv files using pandas and get the schema, first 10 rows, and the row count.

In [34]:
file_path = './data/demographics/us-cities-demographics.csv'

demographics_df = pd.read_csv(file_path)

## get the schema
print('\n** SCHEMA **\n')
print(list(demographics_df))
print()

## get the first 10 rows
print('\n** FIRST 10 ROWS **\n')
print(demographics_df.head(10))
print()

## get the row count
print('\n** ROW COUNT **\n')
print(len(demographics_df))
print()


** SCHEMA **

['City;State;Median Age;Male Population;Female Population;Total Population;Number of Veterans;Foreign-born;Average Household Size;State Code;Race;Count']


** FIRST 10 ROWS **

  City;State;Median Age;Male Population;Female Population;Total Population;Number of Veterans;Foreign-born;Average Household Size;State Code;Race;Count
0  Los Angeles;California;35.0;1958998;2012898;39...                                                                                                   
1  Metairie;Louisiana;41.6;69515;76943;146458;718...                                                                                                   
2  Boca Raton;Florida;47.3;44760;48466;93226;4367...                                                                                                   
3  Quincy;Massachusetts;41.0;44129;49500;93629;41...                                                                                                   
4  Union City;California;38.5;38599;35911;74510;

#### Airport Codes

The airport codes dataset contains airport codes, and corresponding cities

We will read in the `.csv` file using pandas, get the schema, the first 10 rows, and the length of the dataset.

In [35]:
file_path = './data/airport-codes/airport-codes_csv.csv'

airport_codes_df = pd.read_csv(file_path)

print('\n** SCHEMA **\n')
print(list(airport_codes_df))
print()

print('\n** FIRST 10 ROWS **\n')
print(airport_codes_df.head(10))
print()


** SCHEMA **

['ident', 'type', 'name', 'elevation_ft', 'continent', 'iso_country', 'iso_region', 'municipality', 'gps_code', 'iata_code', 'local_code', 'coordinates']


** FIRST 10 ROWS **

  ident           type                                name  elevation_ft  \
0   00A       heliport                   Total Rf Heliport          11.0   
1  00AA  small_airport                Aero B Ranch Airport        3435.0   
2  00AK  small_airport                        Lowell Field         450.0   
3  00AL  small_airport                        Epps Airpark         820.0   
4  00AR         closed  Newport Hospital & Clinic Heliport         237.0   
5  00AS  small_airport                      Fulton Airport        1100.0   
6  00AZ  small_airport                      Cordes Airport        3810.0   
7  00CA  small_airport             Goldstone /Gts/ Airport        3038.0   
8  00CL  small_airport                 Williams Ag Airport          87.0   
9  00CN       heliport     Kitchen Creek Helibas