# Fundamentals of Data Science - Week 5 and Week 6

###  <span style='color: green'>Scroll down to the bottom of the notebook to see your assignment</span> 
<p></p>
<span style='color: red'>Deadline: **25.10.2017 (Wednesday) at 23:55 CEST**</span>

In this notebook, the first section is going to cover the following practical aspects of data science:
+ Creating a Linear Regression model
+ Predicting the model on unseen data and calculating error on the predicted score vs orginal score
+ Create a simple linear regression (with a single variable and a target) on the Diabetes dataset
+ Fit a linear model on the data and plot it
+ Create multivariate linear regression to predict house prices in Boston
+ Plot correlation between variables, predicted price vs original price and calculate mean square errors 


In [1]:
import pandas as pd

import matplotlib.pyplot as plt
import numpy as np
from sklearn import datasets, linear_model
from sklearn.metrics import mean_squared_error, r2_score


In [2]:
import seaborn as sns

The <b>mean squared error</b> has increased. So this shows that a single feature is not a good predictor of housing prices.

** To-Do 1: Make a train-test split and calculate the mean squared error for training data and test data.**

** To-Do 2: Plot the residuals for training and test datasets**


**In the next section, we are going to read in a feather file and assemble the dataset in one Pandas dataframe that we can work with.**
Refer to the "explore_questionnaire.pdf" in the folder for detailed explanation of the dataset.

<img src="./w56.png"/>

To install and run feather use:

**pip install feather-format** -- or else, (if you get import error) **pip install feather-format**

In [3]:
import feather
import numpy
import pandas as pd
import matplotlib.pyplot as plt
from pandas.tools.plotting import scatter_matrix

# Read feather frames to individual variables

anp_df = feather.read_dataframe('data_science_case/anp.feather')
face_df = feather.read_dataframe('data_science_case/face.feather')
image_df = feather.read_dataframe('data_science_case/image_data.feather')
metrics_df = feather.read_dataframe('data_science_case/image_metrics.feather')
object_labels_df = feather.read_dataframe('data_science_case/object_labels.feather')
survey_df = feather.read_dataframe('data_science_case/survey.feather')

## Investigate ANP matrix:

In [4]:
anp_df.head()

Unnamed: 0,image_id,anp_label,anp_sentiment,emotion_score,emotion_label
0,951727030670259635_143763900,hot_boys,0.017,0.176,amazement
1,951727030670259635_143763900,young_couple,0.019,0.2113,joy
2,951727030670259635_143763900,dirty_laundry,-0.263,0.0929,joy
3,951727030670259635_143763900,global_mall,-0.031,0.1304,interest
4,951728575726873168_289794729,high_boots,0.025,0.1394,amazement


For each image, we are given a certain 'topic/label' classification

In [5]:
print anp_df.emotion_label.unique()

[u'amazement' u'joy' u'interest' u'sadness' u'anger' u'terror' u'serenity'
 u'fear' u'trust' u'surprise' u'grief' u'rage' u'boredom' u'ecstasy'
 u'annoyance' u'disgust' u'pensiveness' u'acceptance' u'distraction'
 u'anticipation' u'vigilance' u'loathing' u'apprehension' u'admiration']


Basically, each image is assigned a label (ex. hot_boys, young_couple, dirty_laundry) and each label has a certain emotion label (ex. Joy, vigilance, interest etc) along with a score (emotion_score) which corresponds to what is the percentage of "joy" in the "dirty laundry" label

In [6]:
anp_df.describe()

Unnamed: 0,anp_sentiment,emotion_score
count,325941.0,325941.0
mean,0.064778,0.162398
std,0.396601,0.070143
min,-2.363,0.0417
25%,-0.068,0.1148
50%,0.01,0.1462
75%,0.158,0.1949
max,2.16,0.7347


<img src="./24_emotions_of_Plutchik.png"/>

Basically, each image is assigned a label (ex. hot_boys, young_couple, dirty_laundry) and each label has a certain <b>emotion label</b> out of the 24 emotions of Putchnik

Each Putchnik emotion label  (ex. Joy, vigilance, interest etc) along with a score (emotion_score) which corresponds to what is the percentage of "joy" in the "dirty laundry" label

In [7]:
for anp_label in anp_df['anp_label'].unique()[:10]:
    print 'Images labeled with the ',anp_label, 'tag: show',anp_df[(anp_df.anp_label==anp_label)].emotion_score.unique()[0],anp_df[(anp_df.anp_label==anp_label)].emotion_label.unique()[0]

Images labeled with the  hot_boys tag: show 0.176 amazement
Images labeled with the  young_couple tag: show 0.2113 joy
Images labeled with the  dirty_laundry tag: show 0.0929 joy
Images labeled with the  global_mall tag: show 0.1304 interest
Images labeled with the  high_boots tag: show 0.1394 amazement
Images labeled with the  funny_pets tag: show 0.1924 joy
Images labeled with the  slow_motion tag: show 0.1141 interest
Images labeled with the  funny_dog tag: show 0.2859 joy
Images labeled with the  working_group tag: show 0.1234 amazement
Images labeled with the  old_friends tag: show 0.127 joy


In [8]:
#verify that each ANP label has the same emotion labels and scores across all images
anp_df[(anp_df.emotion_label=='joy') & (anp_df.anp_label=='young_couple')][:3]

Unnamed: 0,image_id,anp_label,anp_sentiment,emotion_score,emotion_label
1,951727030670259635_143763900,young_couple,0.019,0.2113,joy
1016,956837953127354719_6734387,young_couple,0.019,0.2113,joy
1679,961014334304796262_143854846,young_couple,0.019,0.2113,joy


In [9]:
#plot some of the lowest emotion scores
anp_df[anp_df.emotion_score == anp_df.emotion_score.min()][::50]
#aparently no one gives a shit about flat lakes, snowy plovers or gold_bridges

Unnamed: 0,image_id,anp_label,anp_sentiment,emotion_score,emotion_label
9612,581742959299903837_21697543,flat_lake,-0.036,0.0417,interest
44932,844465605783354211_187539125,snowy_plover,0.013,0.0417,vigilance
103220,472461365156868908_25469443,snowy_plover,0.013,0.0417,anger
125497,640658853911311267_55281515,flat_lake,-0.036,0.0417,boredom
158622,888102345472019372_11520833,snowy_plover,0.013,0.0417,loathing
195058,1347305921630342109_619868570,snowy_plover,0.013,0.0417,sadness
206081,1396432770201223841_3807589911,snowy_plover,0.013,0.0417,distraction
261877,472461365156868908_25469443,snowy_plover,0.013,0.0417,amazement
278568,1016105401709793550_6734387,gold_bridge,-0.004,0.0417,fear
313835,1212399656613744758_6734387,gold_bridge,-0.004,0.0417,acceptance


In [10]:
#Whats the ANP label with the most highly correlated emotion?
anp_df[anp_df.emotion_score == anp_df.emotion_score.max()]

Unnamed: 0,image_id,anp_label,anp_sentiment,emotion_score,emotion_label
42212,821358342425467127_31736205,junior_team,-0.104,0.7347,joy
150859,821358342425467127_31736205,junior_team,-0.104,0.7347,joy


## Investigate face_df

In [11]:
face_df.head()

Unnamed: 0,image_id,face_id,face_gender,face_gender_confidence,face_age_range_high,face_age_range_low,face_sunglasses,face_beard,face_beard_confidence,face_mustache,face_mustache_confidence,face_smile,face_smile_confidence,eyeglasses,eyeglasses_confidence,face_emo,emo_confidence
0,1003944279371027183_703978203,6.0,Female,98.741425,38.0,20.0,False,False,99.998474,False,99.999794,False,99.916168,True,95.395546,SAD,12.660271
1,1003944279371027183_703978203,6.0,Female,98.741425,38.0,20.0,False,False,99.998474,False,99.999794,False,99.916168,True,95.395546,CALM,8.252973
2,1003944279371027183_703978203,6.0,Female,98.741425,38.0,20.0,False,False,99.998474,False,99.999794,False,99.916168,True,95.395546,SURPRISED,24.634266
3,1003944279371027183_703978203,68.0,Male,99.927521,77.0,57.0,False,False,99.981598,False,99.993256,False,84.395294,True,99.420914,HAPPY,53.603287
4,1003944279371027183_703978203,68.0,Male,99.927521,77.0,57.0,False,False,99.981598,False,99.993256,False,84.395294,True,99.420914,SAD,5.50909


In [19]:
face_df.face_emo.unique()

array([u'SAD', u'CALM', u'SURPRISED', u'HAPPY', u'ANGRY', u'CONFUSED',
       u'DISGUSTED'], dtype=object)

In [15]:
face_df[face_df.face_id==6].image_id.unique()

array([u'1003944279371027183_703978203', u'1033087147816285039_2062266819',
       u'1037683670639420072_265063047', u'1047215122351788754_703978203',
       u'1093873696185441499_46329534', u'1119217011730655101_481709584',
       u'1104740009251237357_452851338', u'1138955119085200031_265063047',
       u'1150050327748798797_289794729', u'1198728468760038729_242886474',
       u'1250334363540159661_1619510', u'1253288680550029702_53918317',
       u'1255153391978620792_288335200', u'1241646866162536985_249861555',
       u'1273955377204980622_3417740025', u'1367603945475797050_325893678',
       u'1330826916785979304_235671446', u'1349652396028007118_3041716852',
       u'1351769909839102619_34069800', u'1401382493533586371_244047076',
       u'1433016470283815424_372088523', u'1408061263501814861_287562303',
       u'1463497448172783755_235671446', u'1526697378736036722_235671446',
       u'1516940729463761881_372088523', u'1523680577910342346_288335200',
       u'150282984515818670

In [17]:
face_df[face_df.face_id==68].image_id.unique()

array([u'1003944279371027183_703978203', u'1004641390973853146_1600397470',
       u'1005437901930328421_25469443', u'1019701313080178099_545497348',
       u'1022680828042207385_13745951', u'1023119179100895389_50853245',
       u'1023195246083088281_1600397470', u'1026884393467324289_288335200',
       u'1031409866627259791_1600397470',
       u'1030698064156207776_1804133497',
       u'1028195508052991412_1600397470',
       u'1031449440933409165_2062266819', u'1037765269748111439_31736205',
       u'1041690397236181862_50853245', u'1043615270849979288_265063047',
       u'1039877445660566185_55520631', u'1050312984811934717_703978203',
       u'1053215604232659161_265063047', u'1051909114666393795_265063047',
       u'1057508306903973371_22180590', u'1050841776251883710_276232195',
       u'1062696778891723913_34069800', u'1071063239428690938_2032642067',
       u'1071047100158719995_30837828', u'1073471183808795100_246095675',
       u'1088582058421790813_262136545', u'10777036614

In [18]:
face_df[face_df.face_id==43].image_id.unique()

array([u'1000126179441391393_30837828', u'1002379244483879429_265063047',
       u'1005208437899006127_703978203', u'1008919842189085215_1508580385',
       u'1015167181504297657_50853245', u'1020039098149572675_372088523',
       u'1018138219065426633_1901242351', u'1016266728932061350_52590715',
       u'1025493698880638926_48972978', u'1027752914162419810_1508580385',
       u'1034297114384012322_190011156', u'1045631250754824202_289794729',
       u'1046405211291845649_531942752', u'1047339448896648362_143763900',
       u'1051846697123189318_703978203', u'1050822613961848595_545497348',
       u'1056799439775505885_1600397470', u'1065386827092286186_183823541',
       u'1062609832634847755_416455611', u'1071828136522898141_183823541',
       u'1082133558126944457_265063047', u'1073468458542993820_246095675',
       u'1081851010918562642_50853245', u'1088042086580069910_481709584',
       u'1083318303388582244_21697543', u'1095115317902971136_703978203',
       u'109293628809722937

## Investigate image_df

In [25]:
image_df.head()

Unnamed: 0,image_id,image_link,image_url,image_height,image_width,image_filter,image_posted_time_unix,image_posted_time,data_memorability,user_id,user_full_name,user_name,user_website,user_profile_pic,user_bio,user_followed_by,user_follows,user_posted_photos
0,1316962883971761394_3468175004,https://www.instagram.com/p/BJGysPxgsTy/,https://scontent.cdninstagram.com/t51.2885-15/...,640.0,640.0,Lo-fi,1471214231,14-08-2016 22:37:11,0.800521,3468175004,Leah Jenkins,leah.chelle,,https://scontent.cdninstagram.com/t51.2885-19/...,,7.0,0.0,1.0
1,552382455733335946_263042348,https://www.instagram.com/p/eqdOq2JLeK/,https://scontent.cdninstagram.com/t51.2885-15/...,612.0,612.0,Normal,1380069141,25-09-2013 00:32:21,0.875568,263042348,Taylor Degruise,taylordegruise,,https://scontent.cdninstagram.com/t51.2885-19/...,,316.0,347.0,73.0
2,594552614686078174_263042348,https://www.instagram.com/p/hARnP2pLTe/,https://scontent.cdninstagram.com/t51.2885-15/...,640.0,640.0,Vesper,1385096216,22-11-2013 04:56:56,0.672679,263042348,Taylor Degruise,taylordegruise,,https://scontent.cdninstagram.com/t51.2885-19/...,,316.0,347.0,73.0
3,553884883234370621_263042348,https://www.instagram.com/p/evy13fpLQ9/,https://scontent.cdninstagram.com/t51.2885-15/...,640.0,640.0,Amaro,1380248245,27-09-2013 02:17:25,0.843525,263042348,Taylor Degruise,taylordegruise,,https://scontent.cdninstagram.com/t51.2885-19/...,,316.0,347.0,73.0
4,725551583154452417_263042348,https://www.instagram.com/p/oRrVIcJLfB/,https://scontent.cdninstagram.com/t51.2885-15/...,640.0,640.0,Amaro,1400712510,21-05-2014 22:48:30,0.859796,263042348,Taylor Degruise,taylordegruise,,https://scontent.cdninstagram.com/t51.2885-19/...,,316.0,347.0,73.0


## Investigate object_labels_df

In [26]:
object_labels_df.head()

Unnamed: 0,image_id,data_amz_label,data_amz_label_confidence
0,863479386465416946_545497348,Animal,90.163101
1,916939688871507178_545497348,Animal,83.518669
2,551681403589539797_545497348,Animal,74.837212
3,1189285646274180856_545497348,Animal,76.920967
4,962361211517974133_545497348,Animal,71.223869


## Investigate metrics_df

In [27]:
metrics_df.head()

Unnamed: 0,image_id,comment_count,comment_count_time_created,like_count,like_count_time_created
0,1337283311810249709_3041716852,0.0,19-06-2017 19:33:26,15.0,19-06-2017 19:23:26
1,1337283311810249709_3041716852,0.0,19-06-2017 19:23:26,15.0,19-06-2017 19:23:26
2,1337834353379743556_3041716852,0.0,19-06-2017 19:33:26,23.0,19-06-2017 19:23:26
3,1337834353379743556_3041716852,0.0,19-06-2017 19:23:26,23.0,19-06-2017 19:23:26
4,1516356155708878303_3041716852,5.0,19-06-2017 19:23:07,19.0,19-06-2017 19:23:07


## Investigate our Y variables (Survey dataframe)

In [28]:
survey_df.head()

Unnamed: 0,id,gender,born,education,employed,income,A_2,N_1,P_1,E_1,...,P,E,R,M,A,PERMA,N_EMO,P_EMO,imagecount,private_account
0,920bf027f7d13dbdc7b66b3d3324903c,Male,1975,College graduate,Employed for wages,"$30,000 to $39,999",4,5,5,3,...,5,3.0,6.0,6.0,4.0,5,5.0,5,465.0,public
1,b433b2bfe49e28d0b7c45925b53084e0,Male,1978,College graduate,Employed for wages,"$20,000 to $29,999",8,0,9,7,...,9,7.0,9.0,8.0,7.0,9,0.0,9,6.0,public
2,4becd8768d42ffa6ef0a17d827f230a2,Male,1980,High school graduate,Self-employed,"$40,000 to $49,999",7,7,6,9,...,6,9.0,6.0,6.0,8.0,6,7.0,6,,private
3,01d90eeb34866d03c52925738da7865f,Male,1959,College graduate,Employed for wages,"$10,000 to $19,999",6,4,1,5,...,1,5.0,3.0,3.0,3.0,1,4.0,1,,private
4,f4f54676f75f47c17dc434cf68845328,Female,1990,High school graduate,Employed for wages,"$80,000 to $89,999",7,3,8,7,...,8,7.0,8.0,8.0,7.0,8,3.0,8,767.0,public


In [29]:
print survey_df['start_q'].max()
print survey_df['start_q'].min()
print survey_df['end_q'].max()
print survey_df['end_q'].min()

2017-03-23 15:11:19
2016-12-05 14:01:21
2017-03-23 15:16:17
2016-12-05 14:02:52
