# Scripting Exercise: Creating an automatic report
In the course up to now, we've learned about how we can use native python techniques (functions, if/else statements, conditionals, etc) together with data processing libraries like Pandas, to clean, transorm, and analyze data.

In this section, we practice using these tools and see how they can be used to improve processes by automating tasks.

## Scripting in python
Remember that we can run python in two ways: in *scripting mode*, or in *interactive mode*.

When we run code in a notebook cell, we are running code in interactive mode. This is great when we want an immediate output, and when we want to see how our code works and check that it is working properly. 

However, when we want to create a program that we can use to repeatedly complete repetetive tasks, it's best to use python in script mode.

This means we package our code up into a .py file, and run it from the command line. When we do so, our entire file will run, not just individual lines. This is very powerful as it makes executing code for repetitive tasks very easy.

### Moving code into a script
When developing a script for the first time, it can be helpful to first write and check your code in a Jupyter notebook.

Start by moving more and more of your code into one cell. Check where your code is running in the notebook with the command `!pwd`. Think about if this is also where you would want your script to run from.


### Our scripting task
We will build on the work we did in notebook number 7. Real-world Data Practice, and also use the same liquor store dataset. 

Let's imagine you are working for the state of Iowa's Liquor Authority, and you need to make a monthly report with some key facts. Up until now, you have been manually following a list of prescribed protocols to generate the report by hand every month. It generally takes you about half a day once every month to generate this report. The steps are essentially the same every month: you have to filter the data for the month in question, complete some data cleaning steps, and calculate the necessary KPIs. Since you do this manually, there are often occassionally errors. To help make sure the report contains as few errors as possible, you also send the report every month to your colleague to do some sanity checks. This also takes 1 hour of your colleague's time. This means that in total, you and your colleague spend 5 hours every month on preparing the report.

#### The benefit of scripting for repetetive tasks

You tallk with your boss and decide to use your python skills to automate this task using a script. You and your boss estimate that it will take you about 10 hours to write the script, sicne you are still new to Python and it will take you time to debug the code. Once the script is written, it will take you 1 minute to run it each month. This means that after 2 months, having the script will have 'paid for itself' in terms of time saved, you will have sharpened your Python skills, and you and your colleage will now have a combined 5 hours more time each week to devote to other tasks. 

Let's get started!! 

## About the Report
Let's look more closely at the report we need to automate. 
### Goal of the Report
The goal of the report is to succinctly represent some KPIs about the overall situation of liquor sales in the state. These statistics are important as they are included in government documents, so it's also very important that they are checked and that the data underlying them is correct as well.
### KPIs:
The report shows some key facts for the given month, including:
- Store with the most revenue
- Average revenue across all stores
- Most commonly sold item across all stores
- Total number of purchases in the month across all stores 

### Format of the report
The report is given as a text file. 

## Framing the task
Remember that in script mode, we will run our script from the command line. We need to think about where will run our script from, meaning which directory in our file system. We can check this even within the jupyter system by simply putting `!` in front of the command. This exclamation point tells jupyter to execute the command as a bash command on the underlying server/computer system. It will be helpful for us if we start writing our script in the same place where we will run it, as where we run our script will influence while higher-level directories we have to include in our imports. 

We also have to consider where we want our report to be saved. 

In [35]:
# seeing where we are in the file system 
!pwd

/home/jovyan/Intro%20to%20Python


The next part of framing the task is writing what's called *psuedo code*. Psuedo code creates an outline of the code we want to write in the future, written in easy to follow steps in plain English. They should outline the steps our code would follow, or the order of the code that we will write into our program. 

#### Psuedo Code Steps
1. Load in raw data (CSV)
2. Clean data 
3. Generate KPIs
4. Save KPIs to txt file 

Now we iterrate on our psuedo code steps, and add additional considerations. For example, looking in the first step, we could make this more specific to this task. For example, since we need to generate the report each month for only one month, we should add an indication of this requirement at the start. 

#### Deatiled Psuedo Code Steps
1. Load in raw data (CSV)
    - Specify month & year
2. Clean data 
3. Generate KPIs
    - Store with most revenue
    - Average revenue across all stores
    - Most common item sold across all stores
    - Total number of purchases in the month across all stores
5. Sanity checks of the report
4. Save KPIs to txt file 

Now that we have our pseudo code steps, we can start coding, working through them. 

We will recycle some of the code, especially the cleaning code, that we already wrote in part 7. 

#### 1. Load in raw data (CSV) 
We will start by only importing the packages that we know we need. 
This will help later when we move the code into a single script. 

In [36]:
# importing the required packages 
import pandas as pd 

# loading the csv 
data = pd.read_csv('data/liquor/iowa_liquor_sales.csv', index_col=0)

Our next step is to shorten the data to just the month and year in question. 

We will make two new variables for this.  Before we can do this, we also need to make sure that the column is in datetime format. 

Here we can add our first 'sanity check'. To do this, we will use Python's built in [assert statements](https://www.w3schools.com/python/ref_keyword_assert.asp). Assert statements check if our code meets certain criteria. If it doesn't, then assert statements let us return a specific error message, which is helpful for fixing the problem in the future. 

In [37]:
# setting some default values for month and year to use for testing 
year = 2015
month = 8

data.loc[:, 'Date'] = pd.to_datetime(data.loc[:, 'Date'])

# the assert doesn't throw and error, so it passed.
assert str(data.loc[:, 'Date'].dtypes) == 'datetime64[ns]', "Check data type of 'Date' column"

Now we can use our year and month variables to select just the relevant data. 

In [38]:
month_data = data.loc[(data['Date'].dt.month==month) & (data['Date'].dt.year==year)]

Next, we can add another check to make sure that our date selector works.

We'll check that all the data we have is only for this specific year and month selection. 

In [39]:
assert all(month_data.loc[:, 'Date'].dt.month == month)
assert all(month_data.loc[:, 'Date'].dt.year == year)

We will continue to work off of the `month_data` DataFrame for the rest of the script. 

### 2. Clean data
The next step, according to our pseudo code, is to clean the data to prepare for further analysis. 

We can just directly copy the dat cleaning commands we used in part 7. Don't forget, we have to update the name of our dataframe to be `month_data`.

In [40]:
# removing dollar signs so that columns can be converted to numeric
month_data = month_data.replace('\$', '', regex=True)
# removing commas in 1000s
month_data = month_data.replace(',','', regex=True)

# making list of columns to switch to numeric format
columns_to_numeric = ['Sale (Dollars)', 'State Bottle Cost', 'State Bottle Retail']

# converting each column in the list to numeric format
for column in columns_to_numeric:
    month_data.loc[:, column] = pd.to_numeric(month_data.loc[:, column])

### 3. Generate KPIs 
Now that we've prepped the data and made sure that the columns we need for doing the calculations are in numeric format, we can move on to automating the calculation of our monthly KPIs! 

As a refresher, we'll list the KPIs here:
- Store with most revenue
- Average revenue across all stores
- Most common item sold across all stores
- Total number of purchases in the month across all stores

We can work with some of the code we used in notebook number 7 for the calculations.

We will calculate each of these metrics individually and save them to their own variables, to make it easier to write to the text file at the end.

In [41]:
# KPI 1: Store with most revenue

# starting by grouping by stores. Using sum as the aggregator to get total sales/revenue
stores_data_total = month_data.groupby('Store Name').sum()

# finding store with highest total revenue, from Sale (Dollars) column
highest_revenue_store = stores_data_total.loc[:, 'Sale (Dollars)'].idxmax()

# looking at highest revenue store name
highest_revenue_store

'Fareway Stores #987 / Davenport'

Our next KPI involves a calculation made in general, across the entire region. 

Therefore, we don't need to break out the calculation by each store individually. 

In [45]:
# KPI 2: Average revenue across all of Iowa

# Calculating average revenue 
month_avg_revenue = month_data.loc[:, 'Sale (Dollars)'].mean()
month_avg_revenue

117.12893758300132

The next KPI is also calculated across all stores in the entire region. 

It concerns frequency- with this KPI, we seek to know which item was sold most often across all our stores in Iowa. Here, we need to groupby the item. 

Generally with frequency calculations, we would then use `count()` as our aggregator. 

However, in this case, it's possible that more than one bottle of each item were sold in each sale. This is shown in the 'Bottles Sold' column. Thus, to get an accurate count of the frequency of each bottle sold, we have to use sum with this column. 

In [47]:
month_by_item = month_data.groupby('Item Description').sum()

# Calculating average revenue 
month_highest_sold_item = month_by_item.loc[:, 'Bottles Sold'].idxmax()
month_highest_sold_item

'Black Velvet'

Our final KPI involves simply counting the total number of purchases over the entire month.

Given that  each row represents a purchase, we can just count the number of rows in our monthly dataset to get the number of purchases. 

In [69]:
month_orders = month_data.shape[0]
month_orders

1506

### 4. Sanity Checks for the report 
Because our report will be shared to other government officials and considered an official report, we need to have some automatic checks included in it. These automatic checks also prevent the need to have the report checked manually by your colleague, and contribute to the time-saving aspect of automating the report with a script. 

We have already added some of the checks earlier using assert statements in-line as the KPIs were being calculated. 

For further sanity checks, you have decided together with your boss to base these checks off of values from last year and month. Namely, you want to check if the values are in the same range (+/- 25% difference) from the same values in the month prior, and the same range from the same values in the previous year.

Here, we will add some further checks that prompt us to double-check a number manually if it does not fall within some pre-defined boundaries. 
To do this, we will use our `year` and `month` variables to calculate comparison values from previous years and months. 

**Note**: These ranges only work for our KPIs that are numeric (2 & 4. KPIs 1 & 3 are strings).


#### Running the sanity checks
The sanity checks should be run before we run our main script to calculate and create our report, so that we can go back and fix any incorrect data if needed before saving our final report. 

Thus, we will pack these checks into a separate script to run first. 

In [55]:
# calculating the comparison value for last year and last month 
last_month_data = data.loc[(data['Date'].dt.month==month-1) & (data['Date'].dt.year==year)]
last_year_data = data.loc[(data['Date'].dt.month==month) & (data['Date'].dt.year==year-1)]

**NOTE**: We also need to do the same data cleaning steps we did for this month's data.
To do this more eaesily and with less code repetition, we will wrap our data cleaning code into a function.

In [62]:
def clean_monthly_data(month_data):
    # removing dollar signs so that columns can be converted to numeric
    month_data = month_data.replace('\$', '', regex=True)
    # removing commas in 1000s
    month_data = month_data.replace(',','', regex=True)

    # making list of columns to switch to numeric format
    columns_to_numeric = ['Sale (Dollars)', 'State Bottle Cost', 'State Bottle Retail']

    # converting each column in the list to numeric format
    for column in columns_to_numeric:
        month_data.loc[:, column] = pd.to_numeric(month_data.loc[:, column])
    
    return month_data

In [63]:
# applying data cleaning function
last_month_data = clean_monthly_data(last_month_data)
last_year_data = clean_monthly_data(last_year_data)

In [64]:
# Now, we calculate each of the KPIs the same way we did for the current month and year, and compare them. 

# Calculating average revenue 
last_month_avg_revenue = last_month_data.loc[:, 'Sale (Dollars)'].mean()
last_year_avg_revenue = last_year_data.loc[:, 'Sale (Dollars)'].mean()


# now, we will visually compare the different values
print('Current month and year average revenue:', month_avg_revenue)
print('Last month average revenue:', last_month_avg_revenue)
print('Last year average revenue:', last_year_avg_revenue)


Current month and year average revenue: 117.12893758300132
Last month average revenue: 133.20834107498342
Last year average revenue: 121.63055742108796


Looking at the numbers, it looks like they are not very far apart. 
Here, we will write our assert statement to check that the values are within our ranges. 

In [76]:
# calcuating range values 

# getting difference amount to calculate range bounds
difference_amount_avg_revenue = month_avg_revenue*.25

# range bounds are current month values +/- the difference amount
lower_bound_avg_revenue = month_avg_revenue - difference_amount_avg_revenue
upper_bound_avg_revenue = month_avg_revenue + difference_amount_avg_revenue

In [77]:
assert last_month_avg_revenue <= upper_bound_avg_revenue and last_month_avg_revenue >= lower_bound_avg_revenue, 'Check values: difference to last month larger than range allows'
assert last_year_avg_revenue <= upper_bound_avg_revenue and last_year_avg_revenue >= lower_bound_avg_revenue, 'Check values: difference to last year larger than range allows'

Now we will implement a similar check for the other numeric KPI. 

Here, we check that the total number of monthly orders hasn't drastically changed (no more than 25% difference) from last month, nor from this same month last year. We will use essentially the same method as we used above in order to calculate this.

In [70]:
# calculating the total number of orders for last month and last year to compare
last_month_orders = last_month_data.shape[0]
last_year_orders = last_year_data.shape[0]
month_orders = month_data.shape[0]

1506

In [74]:
# calcuating range values 

# getting difference amount to calculate range bounds
difference_amount_month_orders = month_orders*.25

# range bounds are current month values +/- the difference amount
lower_bound_month_orders = month_orders - difference_amount_month_orders
upper_bound_month_orders = month_orders + difference_amount_month_orders

In [79]:
assert last_month_orders <= upper_bound_month_orders and last_month_orders >= lower_bound_month_orders, 'Check values: difference to last month larger than range allows'
assert last_year_orders <= upper_bound_month_orders and last_year_orders >= lower_bound_month_orders, 'Check values: difference to last year larger than range allows'

### 5. Generating the final text file

Now that we are sure that our data makes sense, we can outline the final text file and write the code to save the file.

As this report will be generated on a monthly basis, it will be important for us accessing the reports later that we make sure they are filterable by month and year. 

For the final report, we start by listing the relevant data (month and year, day the report was generated) at the top of the file. For each line or section of the report we want to write, we will put what we want to write in a string. Then we will combine each of these strings into a list and write the entire list into the file. 

Remembr that in python we signify a line break with `\n`. This will be important when we want to separate the sections of our report. Also, remember that we canwrite string quotes that spread over multiple lines using the ''' ''' quotes.

In [94]:
intro_line = '''IOWA STATE LIQUOR REPORT \n \n
Report for {} {} \n \n
'''.format(month, year)

Let's look out how our intro line to the report will look once it's printed to the .txt file:

In [95]:
print(intro_line)

IOWA STATE LIQUOR REPORT 
 

Report for 8 2015 
 




Now we can add a line for each of our various KPIs. 

In [96]:
kpi1_line = 'KPI 1: \n Highest revenue store is : {} \n  \n'.format(highest_revenue_store)

In [97]:
kpi2_line = 'KPI 2: \n Average monthly revenue is: {} \n  \n'.format(month_avg_revenue)

In [98]:
kpi3_line = 'KPI 3: \n Most common item sold : {} \n  \n'.format(month_highest_sold_item)

In [99]:
kpi4_line = 'KPI 4: \n Total orders : {} \n  \n'.format(month_orders)

In [100]:
all_lines = [
    intro_line,
    kpi1_line,
    kpi2_line,
    kpi3_line,
    kpi4_line
]

#### Making a reports directory

For saving our reports, it will be helpful to have a dedicated directory in which to save the new files. 
We will first check if a directory already exists, and if it doesn't then we will create it. 

We will use the [os package](https://docs.python.org/3/library/os.html) from python to navigate the file system. 

In [101]:
import os

In [102]:
if not os.path.exists('reports'):
    os.mkdir('reports')

#### Writing to the file
Now that we have made our `/reports` directory in which to save our reports, we can start saving the report files there. 

To do so, we will write the lines to a .txt file. 

In [103]:
# opening the new file that we will write to 
report_file = open("reports/report_{}_{}.txt".format(month, year),"w") 
  
# # \n is placed to indicate EOL (End of Line) 
# file1.write("Hello \n") 
report_file.writelines(all_lines) 
report_file.close()

## Converting into a Script
Now we've seen that our code works. It's able to save the required data that we want to have in the report. 

Our next step is to combine the code we've written so far, and move it into a .py file that can be run in script mode from the command line. 

Not that we've moved the import code for the OS package to the top of the file.
This is because it's best practice to keep all imports in one place at the top of the file, so that any future programmers looking at the code know right away which packages are used. 

We will also add more comments to make the code more readable as a script. Before, what we were doing with the code was clear with the comments and text in the Jupyter notebook. We want to make sure that our code is still just as readable when it's in a script. We'll also add a longer comment explanation at the start of the script to explain to any new reader of the code what the purpose of the script is. 

We will only include the main code here, not the checks code. We will pack this code into another script to be run separately. 

In [None]:
import os
import pandas as pd 

'''
Iowa Liquor Report Generation Script.
Written by User XX on 8 September, 2020. 

The purpose of this script is to automatically generate the monthly liquor sales 
KPIs required by the government. The data that is used is the general 
liquor sales CSV. 
The report is generated each month, for a given month and year.
A companion script called "liqoue_data_checks.py" runs automatic data 
quality checks and should be run before reports are created with this script.
'''

# loading the csv containing base data
data = pd.read_csv('data/liquor/iowa_liquor_sales.csv', index_col=0)

# setting some default values for month and year to use for testing 
year = 2015
month = 8

# converting date column's datatype
data.loc[:, 'Date'] = pd.to_datetime(data.loc[:, 'Date'])

# Test that the date column's type has been correctly changed
assert str(data.loc[:, 'Date'].dtypes) == 'datetime64[ns]', "Check data type of 'Date' column"

# selecting a subset of the data for the desired year and month
month_data = data.loc[(data['Date'].dt.month==month) & (data['Date'].dt.year==year)]

# Checking that the data only contains data from the correct subset
assert all(month_data.loc[:, 'Date'].dt.month == month)
assert all(month_data.loc[:, 'Date'].dt.year == year)

# Further data cleaning needed for calculating the necessary KPIs
# removing dollar signs so that columns can be converted to numeric
month_data = month_data.replace('\$', '', regex=True)
# removing commas in 1000s
month_data = month_data.replace(',','', regex=True)

# making list of columns to switch to numeric format
columns_to_numeric = ['Sale (Dollars)', 'State Bottle Cost', 'State Bottle Retail']

# converting each column in the list to numeric format
for column in columns_to_numeric:
    month_data.loc[:, column] = pd.to_numeric(month_data.loc[:, column])

# Starting to calculate KPIs

# KPI 1: Store with most revenue
# starting by grouping by stores. Using sum as the aggregator to get total sales/revenue
stores_data_total = month_data.groupby('Store Name').sum()
# finding store with highest total revenue, from Sale (Dollars) column
highest_revenue_store = stores_data_total.loc[:, 'Sale (Dollars)'].idxmax()
# looking at highest revenue store name
highest_revenue_store

# KPI 2: Average revenue across all of Iowa
# Calculating average revenue 
month_avg_revenue = month_data.loc[:, 'Sale (Dollars)'].mean()


# KPI 3: Most sold item in the month
# grouping data by item totals
month_by_item = month_data.groupby('Item Description').sum()
# Calculating average revenue 
month_highest_sold_item = month_by_item.loc[:, 'Bottles Sold'].idxmax()


# KPI 4: Total orders sold in the month
# each row in the data represents one order, so counting rows to get number of orders
month_orders = month_data.shape[0]


# Generating Report txt file

# Using month and year that were imported to initiate the file
intro_line = '''IOWA STATE LIQUOR REPORT \n \n
Report for {} {} \n \n
'''.format(month, year)

# adding each KPI and a small description about it to their own lines
kpi1_line = 'KPI 1: \n Highest revenue store is : {} \n  \n'.format(highest_revenue_store)

kpi2_line = 'KPI 2: \n Average monthly revenue is: {} \n  \n'.format(month_avg_revenue)

kpi3_line = 'KPI 3: \n Most common item sold : {} \n  \n'.format(month_highest_sold_item)

kpi4_line = 'KPI 4: \n Total orders : {} \n  \n'.format(month_orders)

# combining all lines to be included in the report in a list to be added to the report file
all_lines = [
    intro_line,
    kpi1_line,
    kpi2_line,
    kpi3_line,
    kpi4_line
]

# checking if directory to save report in exists, if not, creating it
if not os.path.exists('reports'):
    os.mkdir('reports')


# opening the new file that we will write to 
report_file = open("reports/report_{}_{}.txt".format(month, year),"w") 
# writing all liens to it
report_file.writelines(all_lines) 
# closing file
report_file.close()

### Finalizing the conversion to script mode
For our reporting script to be really effective, we want the user to be able to do everything from the command line - we don't want them to have to go into the script to edit anything. This is especially true concerning the variables in the script that need to be changed or updated every time - the year and month variables for the timeframe when we want to run the script. 

We can use another python library called [argparse](https://docs.python.org/3/library/argparse.html) for this. Argparse lets us add arguments or inputs to our script right from the commange line.

After adding argparse to our script, the next important thing we need to do is add a signal to python about how the script should run. We do this by adding an `if __name__ == '__main__'` block at the bottom of the script, that calls the code in the rest of the script to run. You can learn more about why this block is necessary [here](https://www.geeksforgeeks.org/what-does-the-if-__name__-__main__-do/) and [here](https://stackoverflow.com/questions/419163/what-does-if-name-main-do).

In [None]:
import argparse 
# the rest of the code from the 
if __name__ == '__main__':
    print('main triggered')
    parser = argparse.ArgumentParser(description='Report generator')
    parser.add_argument('--year', action='store', type=int)
    parser.add_argument('--month', action='store', type=int)
    results = parser.parse_args()
    month = results.month
    year = results.year
    
    create_monthly_report(month, year)

## Running the Script

Now that we have transferred the main code to a python script, it's time to run it! 

Remember that when we run Python in script mode, we do so from the command line. 

Every time we run a script from the command line in Python, we start by invoking the computer to run the code in python by simply writing `python`. After the `python` command comes the file path. **If we run the script in the same directory where it is located, we don't need to specify the file path**. However, if we run the code in a higher or lower level of the directory, we will need to specify the correct path to the file. Keep this in mind, as it is the cause of many 'File doesn't exist' errors that you may come across when running python scripts. 

Here, since our script is in the `/scripts` folder, we need to make sure we indicate this at the beginning of the path when running the script. 

After specifying the file path, we then include the arguments we added above using argparser. we indicate these by simply writing the argument keywords like `--year` and `--month`. Then we just write the numbers we want to apply after these keywords, with a space in between.

Below we see a screenshot of how the code works and how exactly the command line command should look.

### Command to run

If you are in the top level of the repository, then the command to run the script should be:
`python scrips/generate_liquor_report.py --year [YEAR] --month [MONTH]` , with `[YEAR]` and  `[MONTH]` being filled in with the values you would like to use to generate your script. 

![running_script](pics/running_script.PNG)

### Further Exercises

Now that you've seen how to build a script from start to finish, you can go further and make updates to the script, and make additional scripts of your own. Some of these questions require knowledge beyond what we've covered so far in the training. These will require some resarch on your part to complete. 

Test your knowledge by completing the following exercises:
1. Make a script called `liqour_data_checks.py` that runs all the data quality checks made in part 4. 
2. Add a question in the `generate_liqour_report.py` script that prompts the user to answer if they've already run the data checks script.
    - Continue the script if they answer 'yes', and stop it if they answer 'no'. 
3. Add further `print()` statements to the script to inform the user what's happening as it runs