# Fundamentals of Data Science - Week 5 and Week 6

###  <span style='color: green'>Scroll down to the bottom of the notebook to see your assignment</span> 
<p></p>
<span style='color: red'>Deadline: **25.10.2017 (Wednesday) at 23:55 CEST**</span>

In this notebook, the first section is going to cover the following practical aspects of data science:
+ Creating a Linear Regression model
+ Predicting the model on unseen data and calculating error on the predicted score vs orginal score
+ Create a simple linear regression (with a single variable and a target) on the Diabetes dataset
+ Fit a linear model on the data and plot it
+ Create multivariate linear regression to predict house prices in Boston
+ Plot correlation between variables, predicted price vs original price and calculate mean square errors 


In [1]:
import pandas as pd

import matplotlib.pyplot as plt
import numpy as np
from sklearn import datasets, linear_model
from sklearn.metrics import mean_squared_error, r2_score


In [2]:
import seaborn as sns

The <b>mean squared error</b> has increased. So this shows that a single feature is not a good predictor of housing prices.

** To-Do 1: Make a train-test split and calculate the mean squared error for training data and test data.**

** To-Do 2: Plot the residuals for training and test datasets**


**In the next section, we are going to read in a feather file and assemble the dataset in one Pandas dataframe that we can work with.**
Refer to the "explore_questionnaire.pdf" in the folder for detailed explanation of the dataset.

<img src="./w56.png"/>

To install and run feather use:

**pip install feather-format** -- or else, (if you get import error) **pip install feather-format**

In [3]:
import feather
import numpy
import pandas as pd
import matplotlib.pyplot as plt
from pandas.tools.plotting import scatter_matrix

# Read feather frames to individual variables

anp_df = feather.read_dataframe('data_science_case/anp.feather')
face_df = feather.read_dataframe('data_science_case/face.feather')
image_df = feather.read_dataframe('data_science_case/image_data.feather')
metrics_df = feather.read_dataframe('data_science_case/image_metrics.feather')
object_labels_df = feather.read_dataframe('data_science_case/object_labels.feather')
survey_df = feather.read_dataframe('data_science_case/survey.feather')

## Investigate ANP matrix:

In [4]:
anp_df.head()

Unnamed: 0,image_id,anp_label,anp_sentiment,emotion_score,emotion_label
0,951727030670259635_143763900,hot_boys,0.017,0.176,amazement
1,951727030670259635_143763900,young_couple,0.019,0.2113,joy
2,951727030670259635_143763900,dirty_laundry,-0.263,0.0929,joy
3,951727030670259635_143763900,global_mall,-0.031,0.1304,interest
4,951728575726873168_289794729,high_boots,0.025,0.1394,amazement


For each image, we are given a certain 'topic/label' classification

In [6]:
print anp_df.emotion_label.unique()

[u'amazement' u'joy' u'interest' u'sadness' u'anger' u'terror' u'serenity'
 u'fear' u'trust' u'surprise' u'grief' u'rage' u'boredom' u'ecstasy'
 u'annoyance' u'disgust' u'pensiveness' u'acceptance' u'distraction'
 u'anticipation' u'vigilance' u'loathing' u'apprehension' u'admiration']


Basically, each image is assigned a label (ex. hot_boys, young_couple, dirty_laundry) and each label has a certain emotion label (ex. Joy, vigilance, interest etc) along with a score (emotion_score) which corresponds to what is the percentage of "joy" in the "dirty laundry" label

In [7]:
anp_df.describe()

Unnamed: 0,anp_sentiment,emotion_score
count,325941.0,325941.0
mean,0.064778,0.162398
std,0.396601,0.070143
min,-2.363,0.0417
25%,-0.068,0.1148
50%,0.01,0.1462
75%,0.158,0.1949
max,2.16,0.7347


In [5]:
for anp_label in anp_df['anp_label'].unique()[:10]:
    print 'Images labeled with the ',anp_label, 'tag: show',anp_df[(anp_df.anp_label==anp_label)].emotion_score.unique()[0],anp_df[(anp_df.anp_label==anp_label)].emotion_label.unique()[0]

Images labeled with the  hot_boys tag: show 0.176 amazement
Images labeled with the  young_couple tag: show 0.2113 joy
Images labeled with the  dirty_laundry tag: show 0.0929 joy
Images labeled with the  global_mall tag: show 0.1304 interest
Images labeled with the  high_boots tag: show 0.1394 amazement
Images labeled with the  funny_pets tag: show 0.1924 joy
Images labeled with the  slow_motion tag: show 0.1141 interest
Images labeled with the  funny_dog tag: show 0.2859 joy
Images labeled with the  working_group tag: show 0.1234 amazement
Images labeled with the  old_friends tag: show 0.127 joy


In [6]:
#verify that each ANP label has the same emotion labels and scores across all images
anp_df[(anp_df.emotion_label=='joy') & (anp_df.anp_label=='young_couple')][:3]

Unnamed: 0,image_id,anp_label,anp_sentiment,emotion_score,emotion_label
1,951727030670259635_143763900,young_couple,0.019,0.2113,joy
1016,956837953127354719_6734387,young_couple,0.019,0.2113,joy
1679,961014334304796262_143854846,young_couple,0.019,0.2113,joy


In [None]:
anp_df.apply(fill_emotions_matrix ,axis=1)

In [8]:
#plot some of the lowest emotion scores
anp_df[anp_df.emotion_score == anp_df.emotion_score.min()][::50]
#aparently no one gives a shit about flat lakes, snowy plovers or gold_bridges

Unnamed: 0,image_id,anp_label,anp_sentiment,emotion_score,emotion_label
9612,581742959299903837_21697543,flat_lake,-0.036,0.0417,interest
44932,844465605783354211_187539125,snowy_plover,0.013,0.0417,vigilance
103220,472461365156868908_25469443,snowy_plover,0.013,0.0417,anger
125497,640658853911311267_55281515,flat_lake,-0.036,0.0417,boredom
158622,888102345472019372_11520833,snowy_plover,0.013,0.0417,loathing
195058,1347305921630342109_619868570,snowy_plover,0.013,0.0417,sadness
206081,1396432770201223841_3807589911,snowy_plover,0.013,0.0417,distraction
261877,472461365156868908_25469443,snowy_plover,0.013,0.0417,amazement
278568,1016105401709793550_6734387,gold_bridge,-0.004,0.0417,fear
313835,1212399656613744758_6734387,gold_bridge,-0.004,0.0417,acceptance


In [11]:
# Merge them based on the image_id so that we have a large data frame containing all the elements

image_anp_frame = pd.merge(image_df, anp_df, how='inner', on='image_id')
im_anp_obj_frame = pd.merge(image_anp_frame, object_labels_df, how='inner', on='image_id')
im_anp_obj_face_frame = pd.merge(im_anp_obj_frame, face_df, how='inner', on='image_id')
im_anp_obj_face_frame = pd.merge(im_anp_obj_frame, face_df, how='inner', on='image_id')

In [12]:
# Calculate the correlation coefficients. Notice how the main diagonal is 1.00

correlation_matrix = im_anp_obj_face_frame.corr()
correlation_matrix

Unnamed: 0,image_height,image_width,data_memorability,user_followed_by,user_follows,user_posted_photos,anp_sentiment,emotion_score,data_amz_label_confidence,face_id,...,face_sunglasses,face_beard,face_beard_confidence,face_mustache,face_mustache_confidence,face_smile,face_smile_confidence,eyeglasses,eyeglasses_confidence,emo_confidence
image_height,1.0,0.367477,0.092194,-0.044179,-0.110193,0.05678,0.015892,0.006732,0.014332,-0.034337,...,0.000161,-0.01069,0.010262,-0.007228,0.008502,0.00013,0.008904,0.004826,0.016588,-0.000509
image_width,0.367477,1.0,-0.048363,0.032737,-0.051495,0.062664,-0.008777,-0.003462,0.000225,0.026916,...,0.00198,0.027605,-0.015589,0.031723,-0.013112,-0.018547,-0.009949,0.008953,0.003785,-0.004968
data_memorability,0.092194,-0.048363,1.0,-0.010293,-0.04638,-0.067173,0.106849,0.055681,0.032926,-0.315009,...,-0.032094,-0.063643,0.047527,-0.03717,0.044731,0.052936,0.044846,-0.033755,0.083209,-0.003185
user_followed_by,-0.044179,0.032737,-0.010293,1.0,0.300155,0.150294,-0.011532,0.002684,0.006521,-0.014647,...,-0.011087,0.043988,-0.028013,0.046854,-0.025272,-0.082719,-0.052183,-0.024712,-0.008538,-0.013778
user_follows,-0.110193,-0.051495,-0.04638,0.300155,1.0,0.040369,-0.017019,0.003237,-0.003662,0.022166,...,0.013937,0.00446,-0.005718,-0.005291,0.001897,0.00714,0.011807,0.010431,-0.003827,0.00505
user_posted_photos,0.05678,0.062664,-0.067173,0.150294,0.040369,1.0,-0.054813,-0.050928,-0.002424,0.024249,...,0.018212,0.061977,-0.010526,0.064271,-0.019818,-0.064448,-0.019717,0.083662,-0.024243,-0.005419
anp_sentiment,0.015892,-0.008777,0.106849,-0.011532,-0.017019,-0.054813,1.0,0.339011,0.005725,-0.0591,...,-0.029941,-0.031011,0.020933,-0.023893,0.01985,0.042821,0.030767,-0.042127,0.017479,0.004419
emotion_score,0.006732,-0.003462,0.055681,0.002684,0.003237,-0.050928,0.339011,1.0,0.008191,-0.020458,...,-0.007838,-0.029593,0.02192,-0.027413,0.023165,0.039717,0.029398,-0.019903,0.015404,0.006878
data_amz_label_confidence,0.014332,0.000225,0.032926,0.006521,-0.003662,-0.002424,0.005725,0.008191,1.0,-0.025877,...,-0.0024,0.000109,-0.003093,-0.00405,0.001187,0.002615,-0.001492,-0.004318,0.007987,-0.001657
face_id,-0.034337,0.026916,-0.315009,-0.014647,0.022166,0.024249,-0.0591,-0.020458,-0.025877,1.0,...,0.036708,0.013296,-0.00371,0.005873,-0.005058,-0.023133,-0.011505,0.009042,-0.027795,0.00524


** To-Do 3: Plot the correlation matrix with color codes corresponding to the how much two attributes are correlated. **

** To-Do 4: Calculate the Spearman rank for the attributes. **


### ASSIGNMENT

In this notebook we learned how to train and test a regressor on numerical data. For this assignment you are required to do the following:

1. Split the data into training and testing splits
2. Train a regressor to predict the PERMA scores on the test set using different sets of attributes (not all of them at once)
3. Analyze which features (attributes) correlate well with each other and help in fitting the curve to the data better.
4. Elaborate on the results.