<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px" />

# Excel Madness Lab!

_Author:_ Tim Book

## Our Mission
We work for a large supermarket chain, with stores in 10 major cities that happen to coincide with General Assembly campuses. However, this company's idea of a "database" is just a bunch of Excel spreadsheets! In order to analyze our data, we're going to need to process the existing data into a form we can use. **Our end goal is to have one csv per city.**

## Cleanup Duty!
It is a hard truth that data scientists spend a large majority of their time cleaning data. Data never arrives on our desks in exactly the format in which we want it, and it's up to us to transform it to a workable format.

Being good cleaning, moving, and reshaping data is in itself a valuable and employable job skill. If you follow these directions exactly, we will walk through constructing an automated process for processing data from this supermarket chain.

# Part I: Processing

### Step 1: Imports and the `os` library
We're going to import three libraries: numpy, pandas, and `os`.

In [1]:
# Import libraries here.
import numpy as np
import pandas as pd
import os

The `os` library is extremely useful for performing system commands from within Python. Let's get two pieces of overhead out of the way now:

1. Create an `output` folder using `os.mkdir()`
2. Create a variable called `files` that is the list of files in the `data` folder using `os.listdir()`

**WARNING:** The `os.mkdir()` function will give you an error if you try to make a folder that already exists!

In [2]:
# Create an output folder.

# Create a files variable that contains all of our data files.

In [3]:
os.mkdir('output')

FileExistsError: [Errno 17] File exists: 'output'

In [4]:

files = os.listdir('./data')
files = pd.Series(files)
files = files[files.str.contains('.xlsx')]

In [5]:
files

0     Jan 26.xlsx
1     Jan 25.xlsx
3     Jan 19.xlsx
4     Jan 14.xlsx
5      Jan 6.xlsx
6      Jan 4.xlsx
7      Jan 1.xlsx
8     Jan 18.xlsx
9     Jan 28.xlsx
10     Jan 9.xlsx
11    Jan 29.xlsx
12     Jan 5.xlsx
13    Jan 31.xlsx
14    Jan 12.xlsx
15     Jan 2.xlsx
16    Jan 23.xlsx
17     Jan 7.xlsx
18    Jan 13.xlsx
19    Jan 10.xlsx
20    Jan 20.xlsx
21    Jan 27.xlsx
22    Jan 24.xlsx
23    Jan 15.xlsx
24    Jan 17.xlsx
25    Jan 16.xlsx
26     Jan 8.xlsx
27    Jan 22.xlsx
28     Jan 3.xlsx
29    Jan 21.xlsx
30    Jan 11.xlsx
31    Jan 30.xlsx
dtype: object

### Step 2: Process one data frame
It looks like we have data for the month of January. 31 files of 10 sheets each! Luckily they are all in the same format. So let's read just one in and process that. It might be helpful to open one up in your spreadsheet viewer of choice first (Excel, Numbers, Sheets, etc.)

In [6]:
# Read in data from your city from January 1st.
#pd.read_csv('./data/Jan 1.xlsx')

# Note, the data has a ParserError: Error tokenizing data. C error: Expected 1 fields in line 5, saw 2
# I cannot open the data as it exists 

# Jan 1.xlsx is not UTF-8 encoded

# To fix this, I need to open it in a text editor and resave it, As of 10/08/2020, I need to do this manually

# NOTE, I found a Jupyter plugin that directly fixes my issue

# Source : https://github.com/quigleyj97/jupyterlab-spreadsheet

In [7]:
jan1 = pd.read_excel('./data/Jan 1.xlsx', sheet_name='Atlanta')

In [8]:
jan1.mean()

prodcode     4191.534759
price_eu        1.784675
weight_kg       5.380099
quantity      300.342246
dtype: float64

### Step 2a: Convert to 'Merican columns
For whatever reason, our data are stored in euros and kilograms. Create `price_usd` and `weight_lb` columns. There are 2.2 pounds per kilogram, and 1.1 dollars per euro.

In [9]:
jan1['price_usd'] = jan1['price_eu'] * 1.1

In [10]:
jan1.head()

Unnamed: 0,prodcode,price_eu,weight_kg,quantity,price_usd
0,4272,1.168366,4.629459,155,1.285202
1,4404,1.996501,2.155133,325,2.196151
2,4131,2.021499,8.38148,418,2.223649
3,4404,2.980396,7.450484,177,3.278436
4,4650,0.977975,8.094924,384,1.075772


In [11]:
jan1['weight_lb'] = jan1['weight_kg'] * 2.2

In [12]:
jan1.head()

Unnamed: 0,prodcode,price_eu,weight_kg,quantity,price_usd,weight_lb
0,4272,1.168366,4.629459,155,1.285202,10.18481
1,4404,1.996501,2.155133,325,2.196151,4.741293
2,4131,2.021499,8.38148,418,2.223649,18.439256
3,4404,2.980396,7.450484,177,3.278436,16.391065
4,4650,0.977975,8.094924,384,1.075772,17.808832


### Step 2b: Merge in product names
You'll notice we also have a `plu-codes.csv` file containing actual product names matched up against their price lookup (PLU) codes. Let's merge these product names onto our Jan 1 data.
* _Hint 1:_ What kind of merge is this? Right, left, inner, outer, etc.?
* _Hint 2:_ Pay special attention to column names!

In [13]:
plu = pd.read_csv("plu-codes.csv")

In [14]:
plu[plu['plu_code'] == 4272]

Unnamed: 0,product,plu_code
10,Grapes,4272


In [15]:
# I'm going to make the plu codes the index of the jan1, so i can easily combine the two DataFrames

In [16]:
jan1.rename(columns={'prodcode': 'plu_code'}, inplace=True)

In [17]:
jan1.head()

Unnamed: 0,plu_code,price_eu,weight_kg,quantity,price_usd,weight_lb
0,4272,1.168366,4.629459,155,1.285202,10.18481
1,4404,1.996501,2.155133,325,2.196151,4.741293
2,4131,2.021499,8.38148,418,2.223649,18.439256
3,4404,2.980396,7.450484,177,3.278436,16.391065
4,4650,0.977975,8.094924,384,1.075772,17.808832


In [18]:
jan1.set_index('plu_code', inplace=True)

In [19]:
jan1.head()

Unnamed: 0_level_0,price_eu,weight_kg,quantity,price_usd,weight_lb
plu_code,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
4272,1.168366,4.629459,155,1.285202,10.18481
4404,1.996501,2.155133,325,2.196151,4.741293
4131,2.021499,8.38148,418,2.223649,18.439256
4404,2.980396,7.450484,177,3.278436,16.391065
4650,0.977975,8.094924,384,1.075772,17.808832


In [20]:
# Now change the name of the other DataFrame

In [21]:
plu.head()

Unnamed: 0,product,plu_code
0,Apple (Fuji),4131
1,Apple (Gala),4134
2,Apricot,3302
3,Avocado,4225
4,Banana,4011


In [22]:
plu.set_index('plu_code', inplace=True)

In [23]:
plu.head()

Unnamed: 0_level_0,product
plu_code,Unnamed: 1_level_1
4131,Apple (Fuji)
4134,Apple (Gala)
3302,Apricot
4225,Avocado
4011,Banana


In [24]:
jan1_complete = pd.merge(jan1, plu, how='left', on='plu_code')

### Step 2c: Drop unnecessary columns
We've created some extraneous columns. Drop the old price and weight columns, as well as any redundant columns.

In [25]:
jan1_complete.head()

Unnamed: 0_level_0,price_eu,weight_kg,quantity,price_usd,weight_lb,product
plu_code,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
4272,1.168366,4.629459,155,1.285202,10.18481,Grapes
4404,1.996501,2.155133,325,2.196151,4.741293,Peach
4131,2.021499,8.38148,418,2.223649,18.439256,Apple (Fuji)
4404,2.980396,7.450484,177,3.278436,16.391065,Peach
4650,0.977975,8.094924,384,1.075772,17.808832,Mushroom


In [26]:
jan1_complete.drop(['price_eu', 'weight_kg'], axis=1, inplace=True)

In [27]:
jan1_complete

Unnamed: 0_level_0,quantity,price_usd,weight_lb,product
plu_code,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
4272,155,1.285202,10.184810,Grapes
4404,325,2.196151,4.741293,Peach
4131,418,2.223649,18.439256,Apple (Fuji)
4404,177,3.278436,16.391065,Peach
4650,384,1.075772,17.808832,Mushroom
...,...,...,...,...
4062,223,3.108474,9.616004,Cucumber
4134,134,2.890181,2.359222,Apple (Gala)
4131,430,2.697951,21.614385,Apple (Fuji)
4011,288,2.436066,9.675293,Banana


In [28]:
jan1_complete.mean()

quantity     300.342246
price_usd      1.963143
weight_lb     11.836218
dtype: float64

### Step 2d: Add the date
Simply create a new `date` column that is the date this data was collected. For example, if this is from `Jan 1.xlsx`, this column should be full of `Jan 1`.

In [29]:
import datetime as dt

In [30]:
jan1_complete['date'] = dt.datetime(2020, 1, 1)

In [31]:
jan1_complete.head()

Unnamed: 0_level_0,quantity,price_usd,weight_lb,product,date
plu_code,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
4272,155,1.285202,10.18481,Grapes,2020-01-01
4404,325,2.196151,4.741293,Peach,2020-01-01
4131,418,2.223649,18.439256,Apple (Fuji),2020-01-01
4404,177,3.278436,16.391065,Peach,2020-01-01
4650,384,1.075772,17.808832,Mushroom,2020-01-01


### Step 3: Write a function that conducts all of Step 2
This function should import a **filename and a city name** and return a fully processed DataFrame. That is, the function should:
1. Read in the data from the given file and city.
1. Create USD and pound columns.
1. Merge in product names.
1. Drop unnecessary columns.
1. Add a date column

In [32]:
def process_data(file, city):
    '''
    This function processes in excel files
    Pass the file with the relative or absolute filepath
    Pass in a city that exists as a sheet name
    
    Eg:
    
    process_data('./data/jan 1.xlsx', 'Atlanta')
    '''
    # This portion copies in our PLU library and prepares it
    plu = pd.read_csv("plu-codes.csv")
    plu.set_index('plu_code', inplace=True)
    
    # This portion passes in your input strings
    # Changes index to PLU code, converts European units to American Units
    # Removes redundant columns
    # Returns a completed Dataframe that is assignable to a filename


    return_filename = file[7:-5]
    #print(return_filename)
    excel_month = return_filename.split(sep=' ')[0]
    excel_day = return_filename.split(sep=' ')[1]
    
    excel = pd.read_excel(file, sheet_name=city)
    excel.rename(columns={'prodcode': 'plu_code'}, inplace=True)
    excel.set_index('plu_code', inplace=True)
    excel['price_usd'] = excel['price_eu'] * 1.1
    excel['weight_lb'] = excel['weight_kg'] * 2.2
    excel_complete = pd.merge(excel, plu, how='left', on='plu_code')
    excel_complete.drop(['price_eu', 'weight_kg'], axis=1, inplace=True)
    excel_complete['date'] = dt.datetime(2020, 1, int(excel_day))
    return excel_complete
    

Test your function out on a new file and city!

In [33]:
jan2_completed = process_data('./data/Jan 31.xlsx', 'Atlanta')

In [34]:
jan2_completed

Unnamed: 0_level_0,quantity,price_usd,weight_lb,product,date
plu_code,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
4159,124,1.008171,19.673902,Onion,2020-01-31
4404,235,1.569633,15.691165,Peach,2020-01-31
4323,143,1.387107,20.688245,Strawberries,2020-01-31
4062,318,2.251221,19.665940,Cucumber,2020-01-31
4323,423,1.934139,4.730840,Strawberries,2020-01-31
...,...,...,...,...,...
4272,375,3.033002,8.460521,Grapes,2020-01-31
4958,440,2.314251,4.608503,Lemon,2020-01-31
4078,444,3.222500,14.817661,Corn,2020-01-31
3302,320,0.741137,4.641995,Apricot,2020-01-31


### Step 4: Process all of January's data
For each spreadsheet, process the data and store the resulting DataFrame in one big list. **You only need to do this for your city!**

* _Hint 1:_ A listcomp would make this whole step one line of code!
* _Hint 2:_ You've already made that `files` variable to help you here.

In [35]:
jan_complete = pd.concat([process_data(f'./data/{i}', 'Atlanta') for i in files])

In [36]:
jan_complete.groupby(by='date')['date'].value_counts()

date        date      
2020-01-01  2020-01-01    187
2020-01-02  2020-01-02    160
2020-01-03  2020-01-03    157
2020-01-04  2020-01-04    101
2020-01-05  2020-01-05    186
2020-01-06  2020-01-06    108
2020-01-07  2020-01-07    112
2020-01-08  2020-01-08    148
2020-01-09  2020-01-09    165
2020-01-10  2020-01-10    137
2020-01-11  2020-01-11    171
2020-01-12  2020-01-12    187
2020-01-13  2020-01-13    146
2020-01-14  2020-01-14    186
2020-01-15  2020-01-15    143
2020-01-16  2020-01-16    191
2020-01-17  2020-01-17    126
2020-01-18  2020-01-18    142
2020-01-19  2020-01-19    112
2020-01-20  2020-01-20    188
2020-01-21  2020-01-21    120
2020-01-22  2020-01-22    133
2020-01-23  2020-01-23    192
2020-01-24  2020-01-24    174
2020-01-25  2020-01-25    175
2020-01-26  2020-01-26    126
2020-01-27  2020-01-27    148
2020-01-28  2020-01-28    199
2020-01-29  2020-01-29    107
2020-01-30  2020-01-30    197
2020-01-31  2020-01-31    140
Name: date, dtype: int64

In [37]:
jan_complete.head()

Unnamed: 0_level_0,quantity,price_usd,weight_lb,product,date
plu_code,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
4412,288,0.733062,5.40812,Pear,2020-01-26
4412,270,1.65089,4.611842,Pear,2020-01-26
4159,435,3.207243,7.610431,Onion,2020-01-26
4012,301,1.943002,8.564789,Orange,2020-01-26
4412,342,3.125553,3.569897,Pear,2020-01-26


In [38]:
jan_complete.mean()

quantity     302.485306
price_usd      1.937980
weight_lb     11.929027
dtype: float64

### Step 5: Concatenate all DataFrames from Step 4 into one large DataFrame
* _Hint:_ Is there a function in `pandas` that can do something like this for us? This is also just one line of code!

In [39]:
# did this !

In [40]:
# changing index from plu to normal integer index

In [41]:
# Source : https://thispointer.com/pandas-convert-dataframe-index-into-column-using-dataframe-reset_index-in-python/
jan_complete = jan_complete.reset_index(level=None, drop=False, inplace=False, col_level=0, col_fill='')

In [42]:
jan_complete.head()

Unnamed: 0,plu_code,quantity,price_usd,weight_lb,product,date
0,4412,288,0.733062,5.40812,Pear,2020-01-26
1,4412,270,1.65089,4.611842,Pear,2020-01-26
2,4159,435,3.207243,7.610431,Onion,2020-01-26
3,4012,301,1.943002,8.564789,Orange,2020-01-26
4,4412,342,3.125553,3.569897,Pear,2020-01-26


### Step 6: Do this for all cities, write data
Here's the big one. For each city, process and the data as in steps 3-5, and then write the data to our `output` folder. Below is a dictionary of city name to desired output file name.

Before writing your DataFrame, do the following:
* Add a `city` column
* Reorder the columns into the following order:


| city | date | product | plu_code | quantity | weight_lb | price_usd |
|---|---|---|---|---|---|---|

In [43]:
jan_complete_atl = jan_complete

In [44]:
jan_complete_atl['city'] = 'Atlanta'

In [45]:
jan_complete.head()

Unnamed: 0,plu_code,quantity,price_usd,weight_lb,product,date,city
0,4412,288,0.733062,5.40812,Pear,2020-01-26,Atlanta
1,4412,270,1.65089,4.611842,Pear,2020-01-26,Atlanta
2,4159,435,3.207243,7.610431,Onion,2020-01-26,Atlanta
3,4012,301,1.943002,8.564789,Orange,2020-01-26,Atlanta
4,4412,342,3.125553,3.569897,Pear,2020-01-26,Atlanta


* _Hint:_ You can reorder DataFrame columns simply by writing over your DataFrame with itself, but specifying the specific column order with `.loc`. For example:

`print(df)`

| b | c | a |
|---|---|---|
| 1 | 2 | 3 |

`df = df.loc[:, ["a", "b", "c"]]`

`print(df)`

| a | b | c |
|---|---|---|
| 3 | 1 | 2 |

In [46]:
jan_complete_atl = jan_complete_atl.loc[:, ['city', 'date', 'product', 'plu_code', 'quantity', 'weight_lb', 'price_usd']]

In [47]:
jan_complete_atl.head()

Unnamed: 0,city,date,product,plu_code,quantity,weight_lb,price_usd
0,Atlanta,2020-01-26,Pear,4412,288,5.40812,0.733062
1,Atlanta,2020-01-26,Pear,4412,270,4.611842,1.65089
2,Atlanta,2020-01-26,Onion,4159,435,7.610431,3.207243
3,Atlanta,2020-01-26,Orange,4012,301,8.564789,1.943002
4,Atlanta,2020-01-26,Pear,4412,342,3.569897,3.125553


In [48]:
jan_complete_atl.mean()

plu_code     4221.003568
quantity      302.485306
weight_lb      11.929027
price_usd       1.937980
dtype: float64

In [50]:
#jan_complete_atl.to_csv(./data/)

In [51]:
city_dict = {
    "Atlanta": "atl.csv",
    "Austin": "atx.csv",
    "Boston": "bos.csv",
    "Chicago": "chi.csv",
    "Denver": "den.csv",
    "Los Angeles": "lax.csv",
    "New York": "nyc.csv",
    "San Francisco": "sf.csv",
    "Seattle": "sea.csv",
    "Washington, DC": "dc.csv"
}

In [70]:
def df_to_csv(input_df, input_dict, city):
    '''
    This function takes in a list of dataframes and returns one and:
    
    Concats the list of dataframes into a single dataframe
    Takes in a dict input for City to filename
    Resets any modified index and moves the index to a column
    Appends a new column for city passed as a string
    Rearranges the columns to this format:
    ['city', 'date', 'product', 'plu_code', 'quantity', 'weight_lb', 'price_usd']
    outputs each Dataframe to its own csv file
    Returns the new Dataframe as a sideproduct
    '''
    input_df = input_df.reset_index(level=None, drop=False, inplace=False, col_level=0, col_fill='')
    input_df['city'] = city
    input_df = input_df.loc[:, ['city', 'date', 'product', 'plu_code', 'quantity', 'weight_lb', 'price_usd']]
    input_df.to_csv(f'./output/{input_dict.get(city)}')
    return input_df

In [71]:
# Loop through city_dict and carry out Step 6 here.
# The keys of city_dict can serve as the sheet name.
# The values of city_dict are what you should name the output .csv files.
# If done correctly, this cell could take almost a minute to run!

In [72]:
# My Test Cell

In [73]:
atl = df_to_csv(pd.concat([process_data(f'./data/{i}', 'Atlanta') for i in files]), city_dict, 'Atlanta')

In [74]:
atl.loc[:,['quantity', 'weight_lb', 'price_usd']].mean()

quantity     302.485306
weight_lb     11.929027
price_usd      1.937980
dtype: float64

![](imgs/correct-output.png)

In [75]:
city_list = [i for i,j in city_dict.items()]
city_list

['Atlanta',
 'Austin',
 'Boston',
 'Chicago',
 'Denver',
 'Los Angeles',
 'New York',
 'San Francisco',
 'Seattle',
 'Washington, DC']

In [76]:
def multiple_to_csv(city):
    '''
    Pass in a list of cities
    '''
    final = []
    for j in range(len(city)):
        final.append(df_to_csv(pd.concat([process_data(f'./data/{i}', city[j]) for i in files]), city_dict, city[j]))
    return final

In [77]:
final = multiple_to_csv(city_list)

In [78]:
final = pd.concat(final)

In [79]:
final.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 47037 entries, 0 to 4520
Data columns (total 7 columns):
 #   Column     Non-Null Count  Dtype         
---  ------     --------------  -----         
 0   city       47037 non-null  object        
 1   date       47037 non-null  datetime64[ns]
 2   product    47037 non-null  object        
 3   plu_code   47037 non-null  int64         
 4   quantity   47037 non-null  int64         
 5   weight_lb  47037 non-null  float64       
 6   price_usd  47037 non-null  float64       
dtypes: datetime64[ns](1), float64(2), int64(2), object(2)
memory usage: 2.9+ MB


In [80]:
final['city'].value_counts()

Denver            4943
Los Angeles       4828
San Francisco     4775
Atlanta           4764
Austin            4700
Chicago           4662
New York          4639
Seattle           4615
Boston            4590
Washington, DC    4521
Name: city, dtype: int64

# Part II: Checking our answers 
In steps very similar to the ones conducted above...
1. Loop through the files we just wrote to `output`, and read them in, collecting them all in one list
1. Combine all of those DataFrames into one large DataFrame
1. For each city, find the mean `quantity`, `weight_lb`, and `price_usd`.

If you've done everything correct, your answer should look exactly like this:

![](imgs/correct-output.png)

In [91]:
def read_csv(files):
    '''
    Input in a list of filenames
    '''
    #files = os.listdir('./output/')
    imports = []    
    return pd.concat([pd.read_csv(f'./output/{i}') for i in files])

In [94]:
#files = os.listdir('./data')
#files = pd.Series(files)
#files = files[files.str.contains('.xlsx')]

csv_files = os.listdir('./output/')
csv_files = pd.Series(csv_files)
csv_files = csv_files[csv_files.str.contains('.csv')]

In [95]:
csv_files

1     atx.csv
2     sea.csv
3     den.csv
4     lax.csv
5      sf.csv
6     bos.csv
7     nyc.csv
8     atl.csv
9      dc.csv
10    chi.csv
dtype: object

In [96]:
answer_checking = read_csv(csv_files)

In [97]:
answer_checking.drop('Unnamed: 0', axis=1, inplace=True)

In [98]:
answer_checking_sorted = answer_checking.loc[:,['city', 'quantity', 'weight_lb', 'price_usd']]

In [99]:
answer_checking_sorted.set_index('city').groupby(by='city').mean()

Unnamed: 0_level_0,quantity,weight_lb,price_usd
city,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Atlanta,302.485306,11.929027,1.93798
Austin,301.048298,12.092775,1.937456
Boston,298.806536,12.063057,1.900973
Chicago,301.686186,12.101555,1.930026
Denver,301.012745,12.12985,1.932088
Los Angeles,301.531276,12.167527,1.918331
New York,299.482863,12.090294,1.912662
San Francisco,298.979895,12.237399,1.92583
Seattle,300.333694,11.925486,1.903315
"Washington, DC",300.576421,11.930747,1.943107


![](imgs/correct-output.png)

In [100]:
# YEAH I FIXED MY BUG
# I forgot to pass the sheet argument in my first function and didn't notice it until at the very end when all of the 
# means were the exact same, representing NYC

# Part III (BONUS): Get this process production-ready!
_This part of the lab is optional, but very highly recommended, as the skills developed in this part are extremely common in industry._

For this step, we're going to take this whole process and put it into a production-ready Python script. **ABSOLUTELY NONE OF THIS STEP SHOULD TAKE PLACE IN A JUPYTER NOTEBOOK! PRODUCTIONALIZED ETL (_"Extract, Transform, Load"_) CODE DOES NOT TAKE PLACE IN A JUPYTER NOTEBOOK!**

The instructions are simple: As conducted in this lab, read, transform, and export the data in our Excel files into .csv files. The code should be in a `.py` file and executable from the command line. Here are some hints and tips to guide you:

### Hints, tips, and tricks:
* A good place to start is with the code you've already written. In this notebook, you can click `File > Download as > Python (.py)` to export as a `.py` file. Most of this exercise then comes down to you cleaning this file. (There will be a lot to clean).
* Remember `os.mkdir()` will throw an error if the folder you're trying to make already exists. Maybe you should check to see if it already exists. If it already exists, what should you do? (Remember that `.csv` can be overwritten with no problem.) The functions that can help you with this are all in the `os` library.
* Remember to follow all of the Python best practices we've already learned:
    - All import statements should go at the top of your script.
    - Comment your code. Comments shouldn't explain _what_ code does, but _why_ the code does this.
    - Keep your code DRY (don't repeat yourself) as opposed to WET (write everything twice). All constants should be variables that only need to be changed once. All code should be bottled into functions so you only need to fix it once.
* Make sure not to hardcode "Jan" anywhere. The point is that this code can be run throughout the lifetime of this supermarket's project, which is likely months or years. Keep your code so that if you get February data, you only need to change one tiny piece of the script (probably a file path).