# Homework 2

### Due: Sun Oct. 21 @ 9pm

In this homework we'll perform a hypothesis test and clean some data before training a regression model.


## Instructions

Follow the comments below and fill in the blanks (____) to complete.

In [None]:
import pandas as pd
import numpy as np
from pprint import pprint
import seaborn as sns
import sklearn
import matplotlib.pylab as plt

# To suppress FutureWarnings
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)
warnings.simplefilter(action='ignore', category=DeprecationWarning)

%matplotlib inline

## Part 1: Hypothesis Testing with an A/B test

Suppose we work at a large company that is developing online data science tools. Currently the tool has interface type A but we'd like to know if using interface tool B might be more efficient.
To measure this, we'll look at length of active work on a project (aka project length).
We'll perform an A/B test where half of the projects will use interface A and half will use interface B.

In [None]:
# read in project lengths from '../data/project_lengths'
# there should be 1000 observations for both interfaces
df_project = pd.read_csv('../data/project_lengths.csv')
df_project.info()

In [None]:
# calculate the difference in mean project length between interface A and B
# for consistency, subtracting A from B
# hint: this number should be negative here (could interpret as faster)
mean_A = ____
mean_B = ____
observed_mean_diff = ____
observed_mean_diff

In [None]:
# we'll perform a permutation test to see how significant this result is
# generate 10000 random permutation samples of mean difference
# hint: use np.random.permutation
rand_mean_diffs = []
n_samples = 10000
combined_times = np.concatenate([df_project.lengths_A.values, df_project.lengths_B.values])
n_A = ____ # number of observations for page A
for i in range(n_samples):
    rand_perm = ____
    rand_mean_A = ____
    rand_mean_B = ____
    rand_mean_diffs.append(____)

In [None]:
# use seaborn to plot the distribution of mean differences
# use plt.vlines to plot a line at our observed difference in means (ymin=0,ymax=0.5)
_ = ____
_ = ____

In [None]:
# the plot should seem to indicate significance, but let's calculate a one-tailed p_value using rand_mean_diffs
p_value = ____
p_value

In [None]:
# we can calculate the effect size of our observation
# this is the absolute value of the observed_mean_diff divided by the standard deviation of the combined_times
observed_effect_size = ____
observed_effect_size

In [None]:
# we'll use this for the next 2 steps
from statsmodels.stats.power import tt_ind_solve_power

In [None]:
# what is the power of our current experiment?
# e.g. how likely is it that correctly decided that B is better than A 
#   given the observed effect size, number of observations and alpha level we used above
# since these are independent samples we can use tt_ind_solve_power
# hint: the power we get should not be good
power = tt_ind_solve_power(effect_size = observed_effect_size,  # what we just calculated
                           nobs1 = n_A,         # the number of observations in A
                           alpha = 0.05,        # our alpha level
                           power = ____,        # what we're interested in
                           ratio = 1            # the ratio of number of observations of A and B
                          )
power

In [None]:
# how many observations for each of A and B would we need to get a power of .9
#   for our observed effect size and alpha level
# eg. having a 90% change of correctly deciding B is better than A
n_obs_A = ____
n_obs_A

## Part 2: Data Cleaning and Regression

### Data Preparation and Exploration

This data is provided by World Bank Open Data https://data.worldbank.org/, processed as in Homework 1.

We will be performing regression with respect to GDP and classification with respect to Income Group.
To do that we will first need to do a little more data prep.

In [None]:
# read in the data
df_country = pd.read_csv('../data/country_electricity_by_region.csv')

# rename columns for ease of reference
columns = ['country_code','short_name','region','income_group','access_to_electricity','gdp','population_density',
           'population_total','unemployment','region_europe','region_latin_america_and_caribbean',
           'region_middle_east_and_north_africa','region_north_america','region_south_asia',
           'region_subsaharan_africa']

df_country.columns = columns
df_country.info()

In [None]:
# create a dummy variable 'gdp_missing' to indicate where 'gdp' is null
df_country['gdp_missing'] = ____

In [None]:
# use groupby to find the number of missing gpd by income_level
# write a lambda function to apply to the grouped data, counting the number of nulls per group
df_country.groupby('income_group').gdp.apply(lambda x: ____)

In [None]:
# fill in missing gdp values according to income_group mean
# to do this, group by income_group 
# then apply a lambda function to the gdp column that uses the fillna function, filling with the mean
# inplace is not available here, so assign back into the gdp column
df_country.gdp = ____

In [None]:
# assert that there are no longer any missing values in gdp
assert ____

In [None]:
# create 'populiation_density_missing' dummy variable
df_country['population_density_missing'] = ____

In [None]:
# fill in missing population_density with median, grouping by region
df_country.population_density = ____

In [None]:
# create a normalized 'gdp_zscore' column
from ____ import ____
df_country['gdp_zscore'] = ____

In [None]:
# use seaborn to create a distplot (with rugplot indicators) and a boxplot of gdp_zscores to visualize outliers
fig, ax = plt.subplots(1,2,figsize=(12,4))
_ = ____
_ = ____

In [None]:
# print the top 10 country_code and gdp_zscore sorted by gdp_zscore
____

In [None]:
# set a zscore cutoff to remove the top 4 outliers
gdp_zscore_cutoff = ____

In [None]:
# create a normalized 'population_density_zscore' column
df_country['population_density_zscore'] = ____

In [None]:
# print the top 10 country_code and population_density_zscore sorted by population_density_zscore
____

In [None]:
# set a zscore cutoff to remove the top 5 outliers
population_density_zscore_cutoff = ____

In [None]:
# drop outliers (considering both gdp_zscore and population_density_zscore)
df_country = df_country[(____) & (____)]
df_country.shape

### Train a Regression Model

In [None]:
# create the training set of X with features (population_density, access_to_electricity) 
# and labels y (gdp)
X = ____
y = ____

In [None]:
# import and initialize a LinearRegression model using default parameters
from ____ import ____
lr = ____

In [None]:
# train the regressor on X and y
____

In [None]:
# print out the learned intercept and coefficients
print(____)
print(____)
print(____)

In [None]:
# we can use this mask to easily index into our dataset
country_mask = (df_country.country_code == 'CAN').values

In [None]:
# how far off is our model's prediction for Canada's gdp (country_code CAN) from it's actual gdp?
____

In [None]:
# create a new training set X that, in addition to population_density and access_to_electricity,
# also includes the region_* dummies
X = df_country[['population_density','access_to_electricity','region_europe','region_latin_america_and_caribbean',
           'region_middle_east_and_north_africa','region_north_america','region_south_asia',
           'region_subsaharan_africa']].values

In [None]:
# instantiate a new model and train, with fit_intercept=False
lr = ____

In [None]:
# did the prediction for CAN improve?
____