### Objective

Concatenate and clean the data before loading into the MySQL database

### Tasks

1. Combine each of the individual weather station CSVs into a single CSV for each calendar year (e.g. 2000, 2001)
2. Remove redundant columns
3. Rename the column names to match that of the database for easier import

### Functions

In [7]:
# Import required libraries
import csv
import os
import pandas as pd
from sqlalchemy import create_engine

In [8]:
# Get the names of all files in the dir
def create_annual_file(folder_path, year):
    '''
        For all of the CSV files in a given folder, combine all of the CSVs into a single annual file
        
        folder_path = String. The folder path where each of the annual folders are saved
        year = String. The calendar year file being created.
        
        returns a DataFrame
    '''
    # Full path name
    full_path = folder_path + year
    
    # Get file names in dir and convert to a list
    file_names = list(os.listdir(full_path))
    
    # Calculate the number of files in the dir
    no_files = len(file_names)
    
    # Combine the CSVs
    combined_file = pd.concat([pd.read_csv(str(full_path + "/" + f)) for f in file_names])
    
    # Return combined_csv df
    return combined_file

In [9]:
# Clean the CSV
def clean_files(df):
    '''
        Clean combined CSV ready for import into MySQL
    
        df = dataframe returned from the combined_csv function. 
        
        returns a DataFrame
    '''
    
    # Create the station_records df 
    df = df.drop(labels = ['LATITUDE', 
                           'LONGITUDE', 
                           'NAME', 
                           'ELEVATION', 
                           'PRCP_ATTRIBUTES', 
                           'TEMP_ATTRIBUTES', 
                           'DEWP_ATTRIBUTES', 
                           'SLP_ATTRIBUTES', 
                           'STP_ATTRIBUTES', 
                           'VISIB_ATTRIBUTES', 
                           'WDSP_ATTRIBUTES', 
                           'MAX_ATTRIBUTES', 
                           'MIN_ATTRIBUTES', 
                           'PRCP_ATTRIBUTES'], 
                 axis=1)
    
    # Rename the columns 
    df.columns = ['StationId', 
                  'Date', 
                  'Temp', 
                  'Dew', 
                  'SLP', 
                  'StationPressure', 
                  'Visib', 
                  'WindSpeed', 
                  'MaxWindSpeed', 
                  'Gust', 
                  'MaxTemp', 
                  'MinTemp', 
                  'Precip', 
                  'SnowDepth', 
                  'Conditions']

    return df

In [10]:
# Run all functions and load results to MySQL
def run(folder_path, start_year, end_year):
    '''
        Converts individual weather station records into annual summary/file/.
        Cleans the annual summary.
        Loads the annual summary to MySQL.
        
        file_path = String. Path to where all of the folders for each year are saved.
        start_year = Int. Start year to combine data
        end_year = Int. End year to combine data
        
        returns None
    '''    
    # Create range of years
    years = range(start_year, end_year + 1)
    
    for year in years:
        # Create annual file
        print("Creating annual file for {}...".format(year))
        df = create_annual_file(folder_path, str(year))
        print("{} file created".format(year))

        # Clean annual file
        print("Cleaning {} file...".format(year))
        df = clean_files(df)
        print("{} file cleaned".format(year))    

        # Load annual file to MySQL
        print("Loading {} file to MySQL...".format(year))
        
        # Create connection to the database
        engine = create_engine('mysql+mysqldb://root:@localhost/globalwarming')

        df.to_sql(con=engine,
                    name='StationRecords', 
                    if_exists = 'append', 
                    index=False)
        
        print("{} file loaded to MySQL".format(year))
        print("-----------------------")

    print("Loaded all years to MySQL")

### Run Function

In [71]:
run("/Users/todddequincey/Downloads/", 2000, 2019)

Creating annual file for 2000...
2000 file created
Cleaning 2000 file...
2000 file cleaned
Loading 2000 file to MySQL...
2000 file loaded to MySQL
-----------------------
Creating annual file for 2001...
2001 file created
Cleaning 2001 file...
2001 file cleaned
Loading 2001 file to MySQL...
2001 file loaded to MySQL
-----------------------
Creating annual file for 2002...
2002 file created
Cleaning 2002 file...
2002 file cleaned
Loading 2002 file to MySQL...
2002 file loaded to MySQL
-----------------------
Creating annual file for 2003...
2003 file created
Cleaning 2003 file...
2003 file cleaned
Loading 2003 file to MySQL...
2003 file loaded to MySQL
-----------------------
Creating annual file for 2004...
2004 file created
Cleaning 2004 file...
2004 file cleaned
Loading 2004 file to MySQL...
2004 file loaded to MySQL
-----------------------
Creating annual file for 2005...
2005 file created
Cleaning 2005 file...
2005 file cleaned
Loading 2005 file to MySQL...
2005 file loaded to MySQ

In [78]:
run("/Users/todddequincey/Downloads/GSOD/", 1995, 1999)

Creating annual file for 1995...
1995 file created
Cleaning 1995 file...
1995 file cleaned
Loading 1995 file to MySQL...
1995 file loaded to MySQL
-----------------------
Creating annual file for 1996...
1996 file created
Cleaning 1996 file...
1996 file cleaned
Loading 1996 file to MySQL...
1996 file loaded to MySQL
-----------------------
Creating annual file for 1997...
1997 file created
Cleaning 1997 file...
1997 file cleaned
Loading 1997 file to MySQL...
1997 file loaded to MySQL
-----------------------
Creating annual file for 1998...
1998 file created
Cleaning 1998 file...
1998 file cleaned
Loading 1998 file to MySQL...
1998 file loaded to MySQL
-----------------------
Creating annual file for 1999...
1999 file created
Cleaning 1999 file...
1999 file cleaned
Loading 1999 file to MySQL...
1999 file loaded to MySQL
-----------------------
Loaded all years to MySQL


In [79]:
run("/Users/todddequincey/Downloads/GSOD/", 1990, 1991)

Creating annual file for 1990...
1990 file created
Cleaning 1990 file...
1990 file cleaned
Loading 1990 file to MySQL...
1990 file loaded to MySQL
-----------------------
Creating annual file for 1991...
1991 file created
Cleaning 1991 file...
1991 file cleaned
Loading 1991 file to MySQL...
1991 file loaded to MySQL
-----------------------
Creating annual file for 1992...


FileNotFoundError: [Errno 2] No such file or directory: '/Users/todddequincey/Downloads/GSOD/1992'

In [80]:
run("/Users/todddequincey/Downloads/GSOD/", 1992, 1994)

Creating annual file for 1992...
1992 file created
Cleaning 1992 file...
1992 file cleaned
Loading 1992 file to MySQL...
1992 file loaded to MySQL
-----------------------
Creating annual file for 1993...
1993 file created
Cleaning 1993 file...
1993 file cleaned
Loading 1993 file to MySQL...
1993 file loaded to MySQL
-----------------------
Creating annual file for 1994...
1994 file created
Cleaning 1994 file...
1994 file cleaned
Loading 1994 file to MySQL...
1994 file loaded to MySQL
-----------------------
Loaded all years to MySQL


In [12]:
run("/Users/todddequincey/Downloads/GSOD/", 1985, 1989)

Creating annual file for 1985...
1985 file created
Cleaning 1985 file...
1985 file cleaned
Loading 1985 file to MySQL...


OperationalError: (_mysql_exceptions.OperationalError) (1292, "Incorrect date value: '01/01/1985' for column 'Date' at row 596") [SQL: 'INSERT INTO `StationRecords` (`StationId`, `Date`, `Temp`, `Dew`, `SLP`, `StationPressure`, `Visib`, `WindSpeed`, `MaxWindSpeed`, `Gust`, `MaxTemp`, `MinTemp`, `Precip`, `SnowDepth`, `Conditions`) VALUES (%s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s)'] [parameters: ((57602099999, '1985-01-01', 42.6, 39.2, 1023.9, 999.9, 0.7, 1.3, 3.9, 999.9, 43.3, 41.7, 0.0, 999.9, 110000), (57602099999, '1985-01-02', 42.6, 39.1, 1024.5, 999.9, 0.9, 1.6, 3.9, 999.9, 44.1, 41.2, 0.01, 999.9, 10000), (57602099999, '1985-01-03', 42.5, 40.1, 1028.0, 999.9, 1.2, 0.4, 1.9, 999.9, 44.4, 40.5, 0.0, 999.9, 10000), (57602099999, '1985-01-04', 42.8, 40.9, 1027.3, 999.9, 1.1, 1.0, 3.9, 999.9, 44.6, 40.5, 0.04, 999.9, 110000), (57602099999, '1985-01-05', 43.5, 39.4, 1023.3, 999.9, 1.4, 1.6, 3.9, 999.9, 45.9, 41.2, 0.0, 999.9, 0), (57602099999, '1985-01-06', 42.0, 40.2, 1024.3, 999.9, 1.0, 1.0, 1.9, 999.9, 42.6, 40.6, 0.05, 999.9, 10000), (57602099999, '1985-01-07', 41.1, 40.5, 1027.8, 999.9, 1.1, 4.3, 7.8, 999.9, 42.8, 39.2, 0.13, 999.9, 10000), (57602099999, '1985-01-08', 40.2, 38.5, 1027.0, 999.9, 1.7, 3.2, 5.8, 999.9, 41.5, 39.0, 0.05, 999.9, 10000)  ... displaying 10 of 2386296 total bound parameter sets ...  (78520199999, '1985-07-10', 87.6, 73.2, 9999.9, 999.9, 19.9, 12.2, 17.1, 999.9, 89.1, 84.9, 0.0, 999.9, 0), (78520199999, '1985-07-11', 89.0, 72.0, 9999.9, 999.9, 19.9, 14.4, 15.0, 999.9, 90.0, 88.0, 0.0, 999.9, 0))] (Background on this error at: http://sqlalche.me/e/e3q8)

In [None]:
run("/Users/todddequincey/Downloads/GSOD/", 1980, 1984)

In [None]:
run("/Users/todddequincey/Downloads/GSOD/", 1975, 1979)

In [None]:
run("/Users/todddequincey/Downloads/GSOD/", 1970, 1974)

In [None]:
run("/Users/todddequincey/Downloads/GSOD/", 1960, 1969)

In [14]:
year_1985 = create_annual_file("/Users/todddequincey/Downloads/GSOD/", "1985")

In [15]:
year_1985 = clean_files(year_1985)

In [16]:
year_1985

Unnamed: 0,StationId,Date,Temp,Dew,SLP,StationPressure,Visib,WindSpeed,MaxWindSpeed,Gust,MaxTemp,MinTemp,Precip,SnowDepth,Conditions
0,57602099999,1985-01-01,42.6,39.2,1023.9,999.9,0.7,1.3,3.9,999.9,43.3,41.7,0.00,999.9,110000
1,57602099999,1985-01-02,42.6,39.1,1024.5,999.9,0.9,1.6,3.9,999.9,44.1,41.2,0.01,999.9,10000
2,57602099999,1985-01-03,42.5,40.1,1028.0,999.9,1.2,0.4,1.9,999.9,44.4,40.5,0.00,999.9,10000
3,57602099999,1985-01-04,42.8,40.9,1027.3,999.9,1.1,1.0,3.9,999.9,44.6,40.5,0.04,999.9,110000
4,57602099999,1985-01-05,43.5,39.4,1023.3,999.9,1.4,1.6,3.9,999.9,45.9,41.2,0.00,999.9,0
5,57602099999,1985-01-06,42.0,40.2,1024.3,999.9,1.0,1.0,1.9,999.9,42.6,40.6,0.05,999.9,10000
6,57602099999,1985-01-07,41.1,40.5,1027.8,999.9,1.1,4.3,7.8,999.9,42.8,39.2,0.13,999.9,10000
7,57602099999,1985-01-08,40.2,38.5,1027.0,999.9,1.7,3.2,5.8,999.9,41.5,39.0,0.05,999.9,10000
8,57602099999,1985-01-09,38.6,36.0,1030.6,999.9,2.6,4.9,7.8,999.9,39.6,38.1,0.08,999.9,10000
9,57602099999,1985-01-10,42.5,36.1,1021.8,999.9,5.8,1.9,5.8,999.9,49.5,36.9,0.00,999.9,0
