# Data Cleaning
I have to deduplicate and clean the data before I can adjust all prices for inflation. Since there are 22 different produce items, this will take some work.

In [1]:
import os
import sqlite3
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

import seaborn as sns
sns.set()

import os
import sys
src_dir = '../../src/d01_data/'
sys.path.append(src_dir)

import d01_data as data_functions


In [2]:
path_to_data = '../../data/00_raw/agriculture_prices.db'
df = data_functions.download_data_from_db(path_to_data)

# Cleaning Data and Adding Features
Updating all prices to reflect 2019 dollars (adjust for inflation), and will be taking an average for all retail prices. 

Since this needs to be done by commodity, I am also going to make a dictionary that holds data frames by commodity, produce_dict. So for example, typing ``produce_dict['Strawberries']`` will return a dataframe concerning only strawberries.

The function below makes the dictionary discussed above.

In [3]:
produce_dict = data_functions.make_dictionary_of_dataframes(df)

# Generalizing Data Cleaning for all Dataframes

I am going to write functions that will automate all the data clean up for each of the 22 dataframes. Starting with the NaN values. 

First, let's observe the total number of NaN values for each dataframe. I also know, from working with the Strawberries df in the lab book, that there are zeros in there where there shouldn't be. I'm going to convert those (or anything less than that) to NaN and count those in the total. So long as the NaN count does not exceed 10% of the data frame, I'll likely drop it. But first, will need to get a count.

In [4]:
def count_na(df):
    '''
    Description: Counts the number of NaN values in a data frame
    Parameters: df - The dataframe to be checked
    Returns: None. Prints our the percentage of each column that is nan
    '''
    # Anywhere a price is equal to or less than zero, assign it to NaN
    df[df.loc[:, ['Farm Price', 'Atlanta Retail', 'Chicago Retail', 'Los Angeles Retail', 'NYC Retail']] <= 0] = np.nan
    print(f'Percentage NaN for {df.iloc[0, -1]}: \n {round((df.isna().sum())/len(df), 3)*100}')
    print(' ')

In [5]:
for produce in list(produce_dict.keys()):
    count_na(produce_dict[produce])

Percentage NaN for Strawberries: 
 Farm Price            0.0
Atlanta Retail        0.0
Chicago Retail        6.9
Los Angeles Retail    0.3
NYC Retail            0.3
Avg Spread            0.0
Commodity             0.0
dtype: float64
 
Percentage NaN for Romaine Lettuce: 
 Farm Price            0.0
Atlanta Retail        0.0
Chicago Retail        5.7
Los Angeles Retail    0.2
NYC Retail            0.4
Avg Spread            0.0
Commodity             0.0
dtype: float64
 
Percentage NaN for Red Leaf Lettuce: 
 Farm Price            0.0
Atlanta Retail        0.0
Chicago Retail        5.1
Los Angeles Retail    0.2
NYC Retail            0.4
Avg Spread            0.0
Commodity             0.0
dtype: float64
 
Percentage NaN for Potatoes: 
 Farm Price            0.1
Atlanta Retail        0.1
Chicago Retail        6.1
Los Angeles Retail    0.1
NYC Retail            4.8
Avg Spread            0.0
Commodity             0.0
dtype: float64
 
Percentage NaN for Oranges: 
 Farm Price            0.1
Atlan

# NaN Values are good to drop. They comprise less than 10% of each column.

Why not just drop all NaN values? The reason I chose to use this criteria is that the more data you have, the better the models you can make with the data. Often one column might be missing many data points but the other columns have those data points present. For example, in the Nectarines dataframe above, only 0.1% of the Farm Price data is NaN while 6.3% of the Chicago Retail data is NaN. When I drop all NaN values, any row with an NaN value is going to be dropped. This means that I will lose 6.3% of my data in the Farm Price column despite only actually missing 0.1%. Same goes for all columns.  

In short, you want to keep as much data as you can because the quality of the models you build are dependent on it. If you use drop_na() on your data, you will drop all rows with any missing values, even if it is only because one column was NaN. So you will be throwing away the same percentage of data as your worst column. For example, I will be throwing away 6.3% of all my Nectarines data even though the other columns are missing much less than that.  

A common strategy for salvaging some of this data is in replacing the missing values with the median value or values you know to make sense for the data. But if you don't need to do it, it's much faster to just drop.

In [6]:
def drop_all_na(df):
    '''
    Description: Drops all rows in a dataframe that have values of NaN
    Parameters: df - the dataframe to have NaN values dropped
    Returns: the dataframe with all NaN values taken out
    '''
    
    df_return = df.dropna(inplace=True)
    return df_return

In [7]:
for produce in list(produce_dict.keys()):
    produce_dict.setdefault(produce, drop_all_na(produce_dict[produce]))

In [8]:
produce_dict['Nectarines'].isna().sum()

Farm Price            0
Atlanta Retail        0
Chicago Retail        0
Los Angeles Retail    0
NYC Retail            0
Avg Spread            0
Commodity             0
dtype: int64

# Deduplicating Data

In [9]:
for produce in list(produce_dict.keys()):
    print(produce_dict[produce].duplicated().sum())

6
12
23
45
52
60
132
139
145
312
315
321
328
332
336
339
342
344
346
350
352
355


Future note to self:   

Wait until you have a total count of NaN values and duplicated values to get a better idea of the amount of data you are going to lose. It's a small enough quantity here to not worry about it, but that might not always be the case.

In [10]:
def drop_all_dupes(df):
    '''
    Description: Drops all duplicates in a dataframe
    Parameters: dataframe to drop all duplicates
    returns: deduplicated dataframe
    '''
    df_return = df.drop_duplicates(inplace=True)
    
    return df_return

In [11]:
for produce in list(produce_dict.keys()):
    produce_dict.setdefault(produce, drop_all_dupes(produce_dict[produce]))

In [12]:
produce_dict['Nectarines'].duplicated().sum()

0

In [13]:
produce_dict['Strawberries'].duplicated().sum()

0

In [14]:
for produce in list(produce_dict.keys()):
    print(produce_dict[produce].info())


<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 907 entries, 2019-05-19 to 2001-02-04
Data columns (total 7 columns):
Farm Price            907 non-null float64
Atlanta Retail        907 non-null float64
Chicago Retail        907 non-null float64
Los Angeles Retail    907 non-null float64
NYC Retail            907 non-null float64
Avg Spread            907 non-null object
Commodity             907 non-null object
dtypes: float64(5), object(2)
memory usage: 56.7+ KB
None
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 1840 entries, 2019-05-19 to 2001-02-04
Data columns (total 7 columns):
Farm Price            1840 non-null float64
Atlanta Retail        1840 non-null float64
Chicago Retail        1840 non-null float64
Los Angeles Retail    1840 non-null float64
NYC Retail            1840 non-null float64
Avg Spread            1840 non-null object
Commodity             1840 non-null object
dtypes: float64(5), object(2)
memory usage: 115.0+ KB
None
<class 'pandas.core.frame.DataF

# Dataframes are clean. Time to convert prices to 2019 USD.

Adjusting prices for inflation based on month using the consumer price index data found here [https://www.usinflationcalculator.com/inflation/consumer-price-index-and-annual-percent-changes-from-1913-to-2008/](https://www.usinflationcalculator.com/inflation/consumer-price-index-and-annual-percent-changes-from-1913-to-2008/)

Everything will be changed to correspond to USD in November of 2019

In [15]:
cpi_df = pd.read_csv('../../data/00_raw/cpi.csv', index_col=0, header=1)

In [16]:
cpi_df

Unnamed: 0_level_0,Jan,Feb,Mar,Apr,May,June,July,Aug,Sep,Oct,Nov,Dec,Avg,Dec-Dec,Avg-Avg
Year,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1
1913,9.800,9.800,9.800,9.800,9.700,9.800,9.900,9.900,10.000,10.000,10.100,10.000,9.900,–,–
1914,10.000,9.900,9.900,9.800,9.900,9.900,10.000,10.200,10.200,10.100,10.200,10.100,10.000,1,1
1915,10.100,10.000,9.900,10.000,10.100,10.100,10.100,10.100,10.100,10.200,10.300,10.300,10.100,2,1
1916,10.400,10.400,10.500,10.600,10.700,10.800,10.800,10.900,11.100,11.300,11.500,11.600,10.900,12.6,7.9
1917,11.700,12.000,12.000,12.600,12.800,13.000,12.800,13.000,13.300,13.500,13.500,13.700,12.800,18.1,17.4
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2015,233.707,234.722,236.119,236.599,237.805,238.638,238.654,238.316,237.945,237.838,237.336,236.525,237.017,0.7,0.1
2016,236.916,237.111,238.132,239.261,240.236,241.038,240.647,240.853,241.428,241.729,241.353,241.432,240.007,2.1,1.3
2017,242.839,243.603,243.801,244.524,244.733,244.955,244.786,245.519,246.819,246.663,246.669,246.524,245.120,2.1,2.1
2018,247.867,248.991,249.554,250.546,251.588,251.989,252.006,252.146,252.439,252.885,252.038,251.233,251.107,1.9,2.4


# Conversion to 2019 dollars and making an average retail column

The objective is to translate all prices in the dataframes to 2019 dollars of the most recent cpi. I have to match the month and year from the index of the produce data frames to the month and year from the cpi data frame and multiply by todays cpi divided by the cpi during that month and year.

First method I can immediately think of is to loop through each dataframe and apply the conversion where the months and year match. This isn't so bad on this dataframe but for large data it would be inefficient. In that case I would start thinking about how you could use arrays to process the data.  

It would be nice to see the average retail prices of the five cities listed so the following function also appends and average city retail price and the variance of that average.

In [17]:
cpi_cols = ['1', '2', '3' , '4', '5', '6', '7', '8', '9', '10', '11', '12', 'Avg', 'Dec-Dec', 'Avg-Avg']

In [18]:
cpi_df.columns = cpi_cols

In [19]:
CPI_2019 = cpi_df.loc[2019][10] # Data frame is zero indexed... 0 is Jan and 1 is Feb etc for second argument.

In [20]:
CPI_2019

257.20799999999997

In [21]:
def inflation_adjustment_for_df(df):
    '''
    Description: Adjusts all individual prices in a dataframe to December 2019. That is, it adjusts for inflation and 
                 reflects the value of a dollar in December 2019. Also creates an average retail column with stdev
    Parameter: Dataframe to be adjusted
    Returns: Inflation adjusted dataframe and an appended average column and stdev column
    '''
    farm = []
    atl = []
    chi=[]
    la=[]
    nyc =[]

    for index_row in df.index:
        count = 0
        for column in df.columns:
            conversion = (CPI_2019/cpi_df.loc[index_row.year][index_row.month - 1])
            value = df[str(index_row)][str(column)].values[0]*conversion
            if column == 'Farm Price':
                farm.append(round(value, 2))
            elif column == 'Atlanta Retail':
                atl.append(round(value,2))

            elif column == 'Chicago Retail':
                chi.append(round(value, 2))

            elif column == 'Los Angeles Retail':
                la.append(round(value, 2))

            elif column == 'NYC Retail':
                nyc.append(round(value,2))

            count+=1
            if count == 5:
                break



    adj_2019_dict = {}

    adj_2019_dict.setdefault('2019 Farm Price', farm)
    adj_2019_dict.setdefault('2019 Atlanta retail', atl)
    adj_2019_dict.setdefault('2019 Chicago Retail', chi)
    adj_2019_dict.setdefault('2019 Los Angeles Retail',la)
    adj_2019_dict.setdefault('2019 NYC Retail', nyc)
    adj_2019_dict.setdefault('Avg Spread', list(df['Avg Spread']))
    adj_2019_dict.setdefault('Commodity', list(df['Commodity']))
    df_2019_adj = pd.DataFrame(adj_2019_dict)
    df_2019_adj.index = df.index
    
    
    avg_retail = [round(np.mean(x[1:5]),2) for x in df.values]
    avg_retail_var = [round(np.var(x[1:5],ddof=1), 2) for x in df.values] 
    df_2019_adj['Avg_Retail'] = avg_retail
    df_2019_adj['Avg_Retail_Var'] = avg_retail_var

    
    return df_2019_adj

In [22]:
for produce in list(produce_dict.keys()):
    df = inflation_adjustment_for_df(produce_dict[produce])
    df.to_pickle(f'../../data/02_processed/{produce}.pkl')

# All Dataframes processed and ready for exploration

In [23]:
df

Unnamed: 0_level_0,2019 Farm Price,2019 Atlanta retail,2019 Chicago Retail,2019 Los Angeles Retail,2019 NYC Retail,Avg Spread,Commodity,Avg_Retail,Avg_Retail_Var
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
2019-05-19,1.17,2.24,1.71,2.00,2.55,82.33%,Nectarines,2.12,0.13
2019-05-12,0.91,2.68,1.90,2.48,2.67,166.21%,Nectarines,2.42,0.13
2019-05-05,0.58,2.43,1.90,2.13,2.92,302.59%,Nectarines,2.33,0.19
2019-04-28,0.58,2.81,1.96,2.40,2.98,334.48%,Nectarines,2.52,0.20
2019-04-21,0.69,2.94,2.05,2.40,3.18,280.43%,Nectarines,2.62,0.26
...,...,...,...,...,...,...,...,...,...
2001-07-01,1.41,3.98,3.62,2.96,3.80,424.17%,Nectarines,1.57,0.08
2001-06-24,1.44,3.60,3.42,3.02,3.61,293.02%,Nectarines,1.69,0.01
2001-06-17,1.23,3.96,2.25,3.06,3.86,209.26%,Nectarines,1.67,0.14
2001-06-10,1.21,4.32,3.55,2.88,3.68,232.33%,Nectarines,1.93,0.03
