# Lab 8: Define and Solve an ML Problem of Your Choosing

In [444]:
import pandas as pd
import numpy as np
import os 
import matplotlib.pyplot as plt
import seaborn as sns

In this lab assignment, you will follow the machine learning life cycle and implement a model to solve a machine learning problem of your choosing. You will select a data set and choose a predictive problem that the data set supports.  You will then inspect the data with your problem in mind and begin to formulate a  project plan. You will then implement the machine learning project plan. 

You will complete the following tasks:

1. Build Your DataFrame
2. Define Your ML Problem
3. Perform exploratory data analysis to understand your data.
4. Define Your Project Plan
5. Implement Your Project Plan:
    * Prepare your data for your model.
    * Fit your model to the training data and evaluate your model.
    * Improve your model's performance.

## Part 1: Build Your DataFrame

You will have the option to choose one of four data sets that you have worked with in this program:

* The "census" data set that contains Census information from 1994: `censusData.csv`
* Airbnb NYC "listings" data set: `airbnbListingsData.csv`
* World Happiness Report (WHR) data set: `WHR2018Chapter2OnlineData.csv`
* Book Review data set: `bookReviewsData.csv`

Note that these are variations of the data sets that you have worked with in this program. For example, some do not include some of the preprocessing necessary for specific models. 

#### Load a Data Set and Save it as a Pandas DataFrame

The code cell below contains filenames (path + filename) for each of the four data sets available to you.

<b>Task:</b> In the code cell below, use the same method you have been using to load the data using `pd.read_csv()` and save it to DataFrame `df`. 

You can load each file as a new DataFrame to inspect the data before choosing your data set.

In [445]:
# File names of the four data sets
adultDataSet_filename = os.path.join(os.getcwd(), "data", "censusData.csv")
airbnbDataSet_filename = os.path.join(os.getcwd(), "data", "airbnbListingsData.csv")
WHRDataSet_filename = os.path.join(os.getcwd(), "data", "WHR2018Chapter2OnlineData.csv")
bookReviewDataSet_filename = os.path.join(os.getcwd(), "data", "bookReviewsData.csv")

df = pd.read_csv(WHRDataSet_filename)

df.head(10)

Unnamed: 0,country,year,Life Ladder,Log GDP per capita,Social support,Healthy life expectancy at birth,Freedom to make life choices,Generosity,Perceptions of corruption,Positive affect,Negative affect,Confidence in national government,Democratic Quality,Delivery Quality,Standard deviation of ladder by country-year,Standard deviation/Mean of ladder by country-year,GINI index (World Bank estimate),"GINI index (World Bank estimate), average 2000-15","gini of household income reported in Gallup, by wp5-year"
0,Afghanistan,2008,3.72359,7.16869,0.450662,49.209663,0.718114,0.181819,0.881686,0.517637,0.258195,0.612072,-1.92969,-1.655084,1.774662,0.4766,,,
1,Afghanistan,2009,4.401778,7.33379,0.552308,49.624432,0.678896,0.203614,0.850035,0.583926,0.237092,0.611545,-2.044093,-1.635025,1.722688,0.391362,,,0.441906
2,Afghanistan,2010,4.758381,7.386629,0.539075,50.008961,0.600127,0.13763,0.706766,0.618265,0.275324,0.299357,-1.99181,-1.617176,1.878622,0.394803,,,0.327318
3,Afghanistan,2011,3.831719,7.415019,0.521104,50.367298,0.495901,0.175329,0.731109,0.611387,0.267175,0.307386,-1.919018,-1.616221,1.78536,0.465942,,,0.336764
4,Afghanistan,2012,3.782938,7.517126,0.520637,50.709263,0.530935,0.247159,0.77562,0.710385,0.267919,0.43544,-1.842996,-1.404078,1.798283,0.475367,,,0.34454
5,Afghanistan,2013,3.5721,7.503376,0.483552,51.04298,0.577955,0.074735,0.823204,0.620585,0.273328,0.482847,-1.879709,-1.403036,1.22369,0.342569,,,0.304368
6,Afghanistan,2014,3.130896,7.484583,0.525568,51.370525,0.508514,0.118579,0.871242,0.531691,0.374861,0.409048,-1.773257,-1.312503,1.395396,0.445686,,,0.413974
7,Afghanistan,2015,3.982855,7.466215,0.528597,51.693527,0.388928,0.094686,0.880638,0.553553,0.339276,0.260557,-1.844364,-1.291594,2.160618,0.54248,,,0.596918
8,Afghanistan,2016,4.220169,7.461401,0.559072,52.016529,0.522566,0.057072,0.793246,0.564953,0.348332,0.32499,-1.917693,-1.432548,1.796219,0.425627,,,0.418629
9,Afghanistan,2017,2.661718,7.460144,0.49088,52.339527,0.427011,-0.10634,0.954393,0.496349,0.371326,0.261179,,,1.454051,0.546283,,,0.286599


## Part 2: Define Your ML Problem

Next you will formulate your ML Problem. In the markdown cell below, answer the following questions:

1. List the data set you have chosen.
2. What will you be predicting? What is the label?
3. Is this a supervised or unsupervised learning problem? Is this a clustering, classification or regression problem? Is it a binary classificaiton or multi-class classifiction problem?
4. What are your features? (note: this list may change after your explore your data)
5. Explain why this is an important problem. In other words, how would a company create value with a model that predicts this label?

1. World Happiness Report
2. I will be predicting the democratic quality of different countries based on economic and social factors. 
3. This is a supervised learning problem that involves regression. This isn't a binary or multiclass problem.
4. The features will include confidence in national government, Social support, GINI index, perceptions of corruption, log GDP per capita, and freedom to make life choices
5. This is an important problem because many things can be done knowing how multiple factors determine the quality of a country's democratic quality. It demonstrates how much people trust their government, and the fairness of the government in terms of people's quality of life through social support and freedom of choices. This can also be seen in terms of corruption and income inequality which all attribute to the democratic quality. If resources and practices are allocated from developed countries to developing and underdeveloped countries, these countries could improve in numerous areas which would cause an increase in happiness as well. Finding out how correlated all these features are to the democratic quality would further which areas could be improved on first. I also believe these factors are strongly or at least moderately correlated with this label.

## Part 3: Understand Your Data

The next step is to perform exploratory data analysis. Inspect and analyze your data set with your machine learning problem in mind. Consider the following as you inspect your data:

1. What data preparation techniques would you like to use? These data preparation techniques may include:

    * addressing missingness, such as replacing missing values with means
    * finding and replacing outliers
    * renaming features and labels
    * finding and replacing outliers
    * performing feature engineering techniques such as one-hot encoding on categorical features
    * selecting appropriate features and removing irrelevant features
    * performing specific data cleaning and preprocessing techniques for an NLP problem
    * addressing class imbalance in your data sample to promote fair AI
    

2. What machine learning model (or models) you would like to use that is suitable for your predictive problem and data?
    * Are there other data preparation techniques that you will need to apply to build a balanced modeling data set for your problem and model? For example, will you need to scale your data?
 
 
3. How will you evaluate and improve the model's performance?
    * Are there specific evaluation metrics and methods that are appropriate for your model?
    

Think of the different techniques you have used to inspect and analyze your data in this course. These include using Pandas to apply data filters, using the Pandas `describe()` method to get insight into key statistics for each column, using the Pandas `dtypes` property to inspect the data type of each column, and using Matplotlib and Seaborn to detect outliers and visualize relationships between features and labels. If you are working on a classification problem, use techniques you have learned to determine if there is class imbalance.

<b>Task</b>: Use the techniques you have learned in this course to inspect and analyze your data. You can import additional packages that you have used in this course that you will need to perform this task.

<b>Note</b>: You can add code cells if needed by going to the <b>Insert</b> menu and clicking on <b>Insert Cell Below</b> in the drop-drown menu.

In [446]:
#renaming features and labels
cols = ['country', 'year',
        'Log GDP per capita', 'Social support',
        'Confidence in national government', 
        'Freedom to make life choices', 
        'Democratic Quality', 'Perceptions of corruption', 
        'GINI index (World Bank estimate), average 2000-15']
renamed = {'Confidence in national government': 'GovConfidence', 
            'Log GDP per capita': 'LogGDP', 
            'Social support': 'Support', 
            'Freedom to make life choices': 'Freedom', 
            'Perceptions of corruption': 'Corruption', 
            'GINI index (World Bank estimate), average 2000-15': 'GINI',
            }

#changing range of years
dfnew = df[cols].rename(renamed, axis=1)
dflim = dfnew[dfnew.year.isin(range(2014,2018))].copy()
print('data types: \n', dflim.dtypes, '\n')
print(dflim.head(10), '\n')

#removing nullified vals
print('null values:\n', np.sum(dflim.isna()), '\n')

dflim['LogGDP'].fillna(value=dflim['LogGDP'].mean(), inplace=True)
dflim['Support'].fillna(value=dflim['Support'].mean(), inplace=True)
dflim['GovConfidence'].fillna(value=dflim['GovConfidence'].mean(), inplace=True)
dflim['Corruption'].fillna(value=dflim['Corruption'].mean(), inplace=True)
dflim['Freedom'].fillna(value=dflim['Freedom'].mean(), inplace=True)
dflim['Democratic Quality'].fillna(value=dflim['Democratic Quality'].mean(), inplace=True)
dflim['GINI'].fillna(value=dflim['GINI'].mean(), inplace=True)

print('cleaned null:\n', np.sum(dflim.isna()), '\n')

data types: 
 country                object
year                    int64
LogGDP                float64
Support               float64
GovConfidence         float64
Freedom               float64
Democratic Quality    float64
Corruption            float64
GINI                  float64
dtype: object 

        country  year    LogGDP   Support  GovConfidence   Freedom  \
6   Afghanistan  2014  7.484583  0.525568       0.409048  0.508514   
7   Afghanistan  2015  7.466215  0.528597       0.260557  0.388928   
8   Afghanistan  2016  7.461401  0.559072       0.324990  0.522566   
9   Afghanistan  2017  7.460144  0.490880       0.261179  0.427011   
16      Albania  2014  9.278097  0.625587       0.498786  0.734648   
17      Albania  2015  9.303031  0.639356       0.506978  0.703851   
18      Albania  2016  9.337774  0.638411       0.400910  0.729819   
19      Albania  2017  9.373718  0.637698       0.457738  0.749611   
23      Algeria  2014  9.509210  0.818189            NaN       NaN   


## Part 4: Define Your Project Plan

Now that you understand your data, in the markdown cell below, define your plan to implement the remaining phases of the machine learning life cycle (data preparation, modeling, evaluation) to solve your ML problem. Answer the following questions:

* Do you have a new feature list? If so, what are the features that you chose to keep and remove after inspecting the data? 
* Explain different data preparation techniques that you will use to prepare your data for modeling.
* What is your model (or models)?
* Describe your plan to train your model, analyze its performance and then improve the model. That is, describe your model building, validation and selection plan to produce a model that generalizes well to new data. 

I don't have a new feature list after inspecting the data. I could add more features but I am choosing to focus on economic and social factors. I will be using the logistic regression model because I am interested in how correlated these features are in attributing to the democratic quality for all countries. I could also use Decision Trees or Stochastic Gradient Descent if I find that a logistic regression model doesn't output very linear relationships, meaning a more complex relationship would have to be analyzed. To improve the model, I will see which features have correlation with the label and drop the features that aren't as correlated with it. I will build a logistic regression model that tests 25% of the data and trains with 75% of it. I will print values of the weights for each feature as well as the overall RMSE and R^2 scores and possibly test with more or less features based off of this. I will also graph the linear regression line as well.

## Part 5: Implement Your Project Plan

<b>Task:</b> In the code cell below, import additional packages that you have used in this course that you will need to implement your project plan.

In [447]:
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score

<b>Task:</b> Use the rest of this notebook to carry out your project plan. 

You will:

1. Prepare your data for your model.
2. Fit your model to the training data and evaluate your model.
3. Improve your model's performance by performing model selection and/or feature selection techniques to find best model for your problem.

Add code cells below and populate the notebook with commentary, code, analyses, results, and figures as you see fit. 

In [448]:
independent = dflim.drop('country', axis=1)
independent = independent.drop('year', axis=1)
X = independent.drop('Democratic Quality', axis=1)
y = dflim['Democratic Quality']

In [449]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.15, random_state=123)

In [450]:
model = LinearRegression()
model.fit(X_train, y_train)
predict_prob = model.predict(X_test)

In [451]:
print('Intercept:' , model.intercept_)
print('\nWeights:')

i = 0
for weight in model.coef_:
    print(X.columns[i], weight)
    i += 1

Intercept: -2.917371917353627

Weights:
LogGDP 0.18595627463227174
Support 0.6716277910910076
GovConfidence -0.9323264561865704
Freedom 1.6840707847185428
Corruption -0.27258654170399876
GINI -0.2434361831595194


In [452]:
print('RMSE = %.2f'
      % np.sqrt(mean_squared_error(y_test, predict_prob)))
print('R^2 = %.2f'
      % r2_score(y_test, predict_prob))

RMSE = 0.63
R^2 = 0.32


Since the R^2 score is low, is implies that all of the features together don't correlate strongly with the label democratic quality. The largest positive weight value is for the freedom feature. The largest negative weight is for the govconfidence feature, indicating a decrease in this means an increase democratic quality. I will drop the LogGDP, and GINI features since they have the least weight values which can indicate not having much correlation to the label.

In [453]:
independent = independent.drop('LogGDP', axis=1)
independent = independent.drop('Corruption', axis=1)
independent = independent.drop('Democratic Quality', axis=1)
X = independent
print(independent.columns)
y = dflim['Democratic Quality']

Index(['Support', 'GovConfidence', 'Freedom', 'GINI'], dtype='object')


In [454]:
X_train2, X_test2, y_train2, y_test2 = train_test_split(X, y, test_size=0.15, random_state=123)

In [455]:
model2 = LinearRegression()
model2.fit(X_train2, y_train2)
predict_prob2 = model2.predict(X_test2)

In [456]:
print('Intercept:' , model2.intercept_)
print('\nWeights:')

i = 0
for weight in model2.coef_:
    print(X.columns[i], weight)
    i += 1

Intercept: -2.1860666124114054

Weights:
Support 1.5565663573270236
GovConfidence -1.09437020240332
Freedom 2.289155293070856
GINI -1.0020797875742806


In [457]:
print('RMSE = %.2f'
      % np.sqrt(mean_squared_error(y_test2, predict_prob2)))
print('R^2 = %.2f'
      % r2_score(y_test2, predict_prob2))

RMSE = 0.65
R^2 = 0.27


Dropping these features didn't really change the R^2 or RMSE even though the weights have drastically increased in value. Since there aren't too many features left, I will try once more to cut down feature values and if this still doesn't improve the accuracy, I will use a different model to test with. This time, I will drop the column GINI and GovConfidence. I also lowered the test_size of all models to 15% and noticed they did slightly better.

In [458]:
independent = independent.drop('GINI', axis=1)
independent = independent.drop('GovConfidence', axis=1)
independent = independent.drop('Freedom', axis=1)
X = independent
print(independent.columns)
y = dflim['Democratic Quality']

Index(['Support'], dtype='object')


In [459]:
X_train3, X_test3, y_train3, y_test3 = train_test_split(X, y, test_size=0.15, random_state=123)

In [460]:
model3 = LinearRegression()
model3.fit(X_train3, y_train3)
predict_prob3 = model3.predict(X_test3)

In [461]:
print('Intercept:' , model3.intercept_)
print('\nWeights:')

i = 0
for weight in model3.coef_:
    print(X.columns[i], weight)
    i += 1

Intercept: -2.447509314642797

Weights:
Support 2.894449784402975


In [462]:
print('RMSE = %.2f'
      % np.sqrt(mean_squared_error(y_test3, predict_prob3)))
print('R^2 = %.2f'
      % r2_score(y_test3, predict_prob3))

RMSE = 0.68
R^2 = 0.21


After testing with the previous columns dropped, I wondered if dropping the freedom feature would indicate anything but the accuracy still didn't improve. I will use a decision tree to capture the complexities between the relationship of these features and label.