## Optimizing the Python Code for Big Data 
Balancing Coding Complexity against Computational Complexity 
    
    AUTHOR: Dr. Roy Jafari 

# Chapter 4: Taking Advantage of Vectorization and Broadcasting (V&B) 

## Challenge 3: Deeply nested loops or shallow loopless codes?

When it comes to data manipulation using python a good rule of thumb is the less nested loop the better. Of course, this is just a rule of thumb and there are cases where having loops are unavoidable.

In this challenge, we will work on a big data manipulation task. I will give you a solution that has lots of nested loops and your job will be to transform the solution into one that only has the necessary loops. Let’s get started. 

In this challenge, we will be using the United States Presidential elections historical dataset. You may access the data from its source which is [MIT Election Data Science Lab accessible](https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/VOQCHQ). You may also download the file *countypres_2000-2020.csv* directly from the following link https://packt-data-prep-workshop.s3.eu-west-1.amazonaws.com/countypres_2000-2020.csv. 

Answer the following quesitons or complete the following steps.

1.	The following code reads *countypres_2000-2020.csv* into the `election_df` pandas DataFrame. Run the following code and study the columns of `election_df`.


In [None]:
import pandas as pd
election_df = pd.read_csv('countypres_2000-2020.csv')
election_df.head()

2.	The following code counts the number of unique counties in the US from `election_df`. Run the code to realize how many counties are in the US.

In [None]:
us_counties = (
    election_df.state_po +
    ' - ' +
    election_df.county_name
).unique()
print(us_counties)
print(len(us_counties))

3.	The following code filters `election_df` to isolate the data of Yuma county in Arizona. Run the code and study its printout. Specifically, pay attention to the columns `party` and `mode`.

In [None]:
yuma_df = (election_df
 .query('state_po == "AZ"')
 .query('county_name == "YUMA"')
)
yuma_df

4.	The following code only keeps the rows where the value of party is either **DEMOCRAT** or **REPUBLICAN**.  Study the code and its printout.

In [None]:
BM = yuma_df['party'].isin(['DEMOCRAT','REPUBLICAN'])
yuma_df = (
    yuma_df[BM]
    .copy()
)
yuma_df

5.	The following code uses `.groupby()` function to make sure that all of the different modes of elections have been summed up for every election and every party in each county. Run the code and study the reduced version of `yuma_df`. 



In [None]:
yuma_df = (
    yuma_df
    .groupby(
        ['year','state_po','county_name','party']
    ).candidatevotes
    .sum()
    .reset_index()
)
print(yuma_df)

6.	The following code first calculates the total number of votes in Yuma county in every election year and then uses it to calculate `percentvotes` for every party in every election year. Study the code, run it, and study its printouts. 

In [None]:
total_df = yuma_df.groupby('year').candidatevotes.sum()
print(total_df)
yuma_df['percentvotes'] = None
for i, row in yuma_df.iterrows():
    yuma_df.at[i, 'percentvotes'] = (
        row.candidatevotes / total_df.loc[row.year]
    )
yuma_df

7.	The following code creates two line plots that show **DEMOCRAT** and **REPUBLICAN** trends of the two parties in the past 6 US presidential elections. Run the code and study the line plots. 

In [None]:
import matplotlib.pyplot as plt
import numpy as np
rep_BM = yuma_df.party == 'REPUBLICAN'
dem_BM = yuma_df.party == 'DEMOCRAT'
rep_sr = yuma_df[rep_BM].reset_index()
dem_sr = yuma_df[dem_BM].reset_index()
x = np.arange(6)
f,axes = plt.subplots(1,2,figsize=(15,4))
rep_sr.percentvotes.plot(ax = axes[0], label='REPUBLICAN',c='gray')
axes[0].set_ylabel('Percent Vote')
axes[0].set_xticks(x)
axes[0].set_xticklabels([2000,2004,2008,2012,2016,2020])
axes[0].legend()
dem_sr.percentvotes.plot(ax = axes[1], label='DEMOCRAT',c='gray')
axes[1].set_xticks(x)
axes[1].set_xticklabels([2000,2004,2008,2012,2016,2020])
axes[1].legend()
plt.legend()
plt.show()

8.	Describe the trend that you see in the two line plots that you created in Step 7. 

**Answer**: 

9.	The following code recreates the preceding figure some addition. The code fits two linear lines to the line plots and draws them on top of the plots. Moreover, it also adds the regression equation of the fitted lines to the plots. Study and run the following code, and then study the code’s output. Study the code and the visualization it creates. 

In [None]:
from scipy.optimize import curve_fit
def linear_function(x,a,b):
    return a*x+b
x = np.arange(6)
f,axes = plt.subplots(1,2,figsize=(15,4))
rep_sr.percentvotes.plot(ax = axes[0], label='REPUBLICAN',c='gray')
p,_ = curve_fit(linear_function,x,rep_sr.percentvotes.values)
axes[0].plot(x,linear_function(x,p[0],p[1]),label='Fitted Line')
axes[0].annotate(xy=(0,0.51),
                text = f'reg equation: y= {p[0]:.4f}x + {p[1]:.4f}')
axes[0].set_ylabel('Percent Vote')
axes[0].set_xticks(x)
axes[0].set_xticklabels([2000,2004,2008,2012,2016,2020])
axes[0].legend()
dem_sr.percentvotes.plot(ax = axes[1], label='DEMOCRAT',c='gray')
p,_ = curve_fit(linear_function,x,dem_sr.percentvotes.values)
axes[1].plot(x,linear_function(x,p[0],p[1]),label='Fitted Line')
axes[1].annotate(xy=(0,0.47),
                text = f'reg equation: y= {p[0]:.4f}x + {p[1]:.4f}')
axes[1].set_xticks(x)
axes[1].set_xticklabels([2000,2004,2008,2012,2016,2020])
axes[1].legend()
plt.legend()
plt.tight_layout()
plt.savefig('images/challenge3_8.png',dpi=500)

10.	Each line has two parameters, an intercept, and a slope. For instance, the slope of the line that fits the republication vote percent in yuma_df is **-0.0115** and the intercept of the same line is **0.5816**.  With these two parameters, you can redraw the line. Compare the slope and intercept of the two lines shown in the preceding figure. What can we learn?

**Answer**: 

11.	So far in this challenge, we were getting to know `election_df` and the four parameters we want to extract for each US county.  These four parameters are listed as follows.

- **Slope DEMOCRAT**: The slope of a line that fits the trend of DEMOCRAT from percentvotes 2000 to 2020.
- **Intercept DEMOCRAT**: The intercept of a line that fits the trend of DEMOCRAT from percentvotes 2000 to 2020.
- **Slope REPUBLICAN**: The slope of a line that fits the trend of REPUBLICAN from percentvotes 2000 to 2020.
- **Intercept REPUBLICAN**: The intercept of a line that fits the trend of REPUBLICAN from percentvotes 2000 to 2020.

 In this challenge, the task is to capture these parameters for each county in the best possible way we can. 
The following code creates `param_df` which is a placeholder we will use to insert the value of the parameters we calculate during the data manipulation task in the next step. 

In [None]:
BM = election_df.party.isin(['DEMOCRAT','REPUBLICAN'])

param_df = pd.DataFrame(
    election_df[['state_po','county_name','party']][BM].copy()
    .drop_duplicates()
    .reset_index(drop=True)

    .assign(slope=None)
    .assign(intercept=None)
    .set_index(['state_po','county_name','party'])
    .unstack()
)
print(param_df)

12.	The following code uses three-level nested loops to fill `param_df` and captures the four parameters for each county in the US. Study the code, and describe how the code gets the task done.

```
parties = ['DEMOCRAT','REPUBLICAN']

for state, county in param_df.index:
    BM = election_df.state_po == state
    BM = BM & (election_df.county_name ==county)
    
    county_df = (
        election_df[BM]
        .groupby(['year','party'])
        .candidatevotes
        .sum()
        .reset_index()
    )
    
    total_df = (
        county_df
        .groupby('year')
        .candidatevotes
        .sum()
    )
    
    if len(total_df)<2 or total_df.sum()==0:
        continue
    
    for party in parties:
        BM = county_df.party == party
        party_df = county_df[BM].copy()
        
        if party_df.empty:
            continue
            
        party_df['percentvotes'] = None
        
        for i, row in party_df.iterrows():
            party_df.at[i, 'percentvotes'] = (
                row.candidatevotes / total_df.loc[row.year]
            )
            
        
        
        party_df.percentvotes = (
            party_df.percentvotes.fillna(
                party_df.percentvotes.median()
            )
        )
        
        p,_ = curve_fit(linear_function,
                        np.arange(len(party_df)),
                        party_df.percentvotes.values)
        
        param_df.loc[(state,county),('slope',party)] = p[0]
        param_df.loc[(state,county),('intercept',party)] = p[1]
```

**Answer**: 



13.	Run the code in the previous step, and time how long it takes to complete. Please pay attention that the code throws out some warnings that you can just ignore.

**Answer**: 

14.	Your challenge is to redo what the code in Step 11 does, but use as few loos as possible. Challenge yourself, and see if you can do it with no loops. Hint: in my solution with a few loops (perhaps none), I ended up using the following functions: pandas DataFrame `.groupby()`, `.merge()`, `.unstack()`, and `.apply()` functions. You won’t have to use these, whatever you can make it work is great. 

**Answer**: 

15.	Compare the runtime performance of the solution with multiple loops and the one you created that had fewer loops (ideally none). Which one ends faster? What are your conclusions?

**Answer**: 