## Data Wrangling Phase

### Read in the data

In [24]:
import pandas as pd
import matplotlib.pyplot as plt
#%matplotlib inline
%pylab inline
import seaborn as sns

df = pd.read_csv('input/titanic-data.csv')
df.head()


Populating the interactive namespace from numpy and matplotlib


Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [25]:
df.describe()

Unnamed: 0,PassengerId,Survived,Pclass,Age,SibSp,Parch,Fare
count,891.0,891.0,891.0,714.0,891.0,891.0,891.0
mean,446.0,0.383838,2.308642,29.699118,0.523008,0.381594,32.204208
std,257.353842,0.486592,0.836071,14.526497,1.102743,0.806057,49.693429
min,1.0,0.0,1.0,0.42,0.0,0.0,0.0
25%,223.5,0.0,2.0,20.125,0.0,0.0,7.9104
50%,446.0,0.0,3.0,28.0,0.0,0.0,14.4542
75%,668.5,1.0,3.0,38.0,1.0,0.0,31.0
max,891.0,1.0,3.0,80.0,8.0,6.0,512.3292


In [26]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
PassengerId    891 non-null int64
Survived       891 non-null int64
Pclass         891 non-null int64
Name           891 non-null object
Sex            891 non-null object
Age            714 non-null float64
SibSp          891 non-null int64
Parch          891 non-null int64
Ticket         891 non-null object
Fare           891 non-null float64
Cabin          204 non-null object
Embarked       889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.6+ KB


## Data Dictionary

Copied from the "data introduction" of the [kaggle competition](https://www.kaggle.com/c/titanic/data) .

| Variable	| Definition	| Key	|
|-----------|---------------|-------|
| survival	| Survival	    | 0 = No, 1 = Yes	 |
| pclass	| Ticket class	| 1 = 1st, 2 = 2nd, 3 = 3rd	|
| sex   | 	Sex		|
| Age	| Age | in years		|
| sibsp	| # of siblings / spouses aboard the Titanic	|
| parch	| # of parents / children aboard the Titanic	|	
| ticket | 	Ticket number		|
| fare	| Passenger fare		|
| cabin	| Cabin number		    |
| embarked | 	Port of Embarkation | 	C = Cherbourg, Q = Queenstown, S = Southampton | 


## Variable Notes


pclass: A proxy for socio-economic status (SES)	
1st = Upper	
2nd = Middle	
3rd = Lower	

age: Age is fractional if less than 1. If the age is estimated, is it in the form of xx.5	

sibsp: The dataset defines family relations in this way...	
Sibling = brother, sister, stepbrother, stepsister	
Spouse = husband, wife (mistresses and fiancés were ignored)	

parch: The dataset defines family relations in this way...	
Parent = mother, father	
Child = daughter, son, stepdaughter, stepson	
Some children travelled only with a nanny, therefore parch=0 for them.	



## Data Wrangling Phase

From previous phase we can see that some data are missing in "Cabin", "Age", "Embarked".


In [27]:
new_df = df.drop(['Cabin'], axis = 1)
new_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 11 columns):
PassengerId    891 non-null int64
Survived       891 non-null int64
Pclass         891 non-null int64
Name           891 non-null object
Sex            891 non-null object
Age            714 non-null float64
SibSp          891 non-null int64
Parch          891 non-null int64
Ticket         891 non-null object
Fare           891 non-null float64
Embarked       889 non-null object
dtypes: float64(2), int64(5), object(4)
memory usage: 76.6+ KB


In [28]:
#quote: some ideas to get you started:
#Titanic Data
#What factors made people more likely to survive?


# copy the correlation function in the class video.

def correlation(x, y):
    '''
    Fill in this function to compute the correlation between the two
    input variables. Each input is either a NumPy array or a Pandas
    Series.
    correlation = average of (x in standard units) times (y in standard units)
    Remember to pass the argument "ddof=0" to the Pandas std() function!
    '''
    x_std = (x - x.mean() ) / x.std(ddof = 0)
    y_std = (y - y.mean() ) / y.std(ddof = 0)
    mean = (x_std * y_std).mean() 
    return mean


In [29]:

# and try to get the correlations 
pid = df['PassengerId']
survived = df['Survived']
pclass = df['Pclass']
name = df['Name']
sex = df['Sex']
age = df['Age']
sibsp = df['SibSp']
parch = df['Parch']
ticket = df['Ticket']
fare = df['Fare']
cabin = df['Cabin']
embarked = df['Embarked']

print('correlation(survived, pid):', correlation(survived, pid) )
print('correlation(survived, pclass): ', correlation(survived, pclass) )
print('correlation(survived, age): ', correlation(survived, age) )
print('correlation(survived, sibsp):', correlation(survived, sibsp) )
print('correlation(survived, parch):', correlation(survived, parch) )
print('correlation(survived, fare):', correlation(survived, fare) )
#print('correlation(survived, sex):', correlation(survived, sex) )
#print('correlation(survived, name):', correlation(survived, name) )
#print('correlation(survived, embarked):', correlation(survived, embarked) )
# Error. Now I know that I only correlate numerical datas.



correlation(survived, pid): -0.005006660767066522
correlation(survived, pclass):  -0.33848103596101325
correlation(survived, age):  -0.077982678413863
correlation(survived, sibsp): -0.03532249888573573
correlation(survived, parch): 0.08162940708348272
correlation(survived, fare): 0.2573065223849616


### So now Pclass, -0.33. Fare, 0.25 . Parch 0.08.
### So Maybe I should investigate these more.
'Sex' is something I wondered about, but we cannot correlate strings 'male' 'female', so we need to transform it to numerical.


In [33]:
def transform_sex(df_sex):
    if df_sex == 'male':
        df_sex = 1
    elif (df_sex == 'female'):
        df_sex = 0

        
print(df.head())        
#new_df = df
#apply()
df.head()


   PassengerId  Survived  Pclass  \
0            1         0       3   
1            2         1       1   
2            3         1       3   
3            4         1       1   
4            5         0       3   

                                                Name     Sex   Age  SibSp  \
0                            Braund, Mr. Owen Harris    male  22.0      1   
1  Cumings, Mrs. John Bradley (Florence Briggs Th...  female  38.0      1   
2                             Heikkinen, Miss. Laina  female  26.0      0   
3       Futrelle, Mrs. Jacques Heath (Lily May Peel)  female  35.0      1   
4                           Allen, Mr. William Henry    male  35.0      0   

   Parch            Ticket     Fare Cabin Embarked  
0      0         A/5 21171   7.2500   NaN        S  
1      0          PC 17599  71.2833   C85        C  
2      0  STON/O2. 3101282   7.9250   NaN        S  
3      0            113803  53.1000  C123        S  
4      0            373450   8.0500   NaN        S  


Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [None]:
print('age fare corr:', correlation(age, fare) )
#This could lead to anothre interesting question: Does older people pay more?


In [None]:
This could lead to anothre interesting question: Does older people pay more

In [None]:

'''
m, b = np.polyfit(fare, fare, 1)
plt.plot(fare, m * fare + b, '-')

m, b = np.polyfit(age, fare, 1)
plt.plot(age, m * age + b, '-')
'''
new_age = np.nan_to_num(age)
print(new_age.shape)


# https://stackoverflow.com/questions/19068862/how-to-overplot-a-line-on-a-scatter-plot-in-python
m, b = np.polyfit(new_age, fare, 1)
plt.plot(new_age, fare, '.')
plt.plot(new_age, m * new_age + b, '-')
print('m = ', m)


In [None]:
# https://stackoverflow.com/questions/14016247/python-find-integer-index-of-rows-with-nan-in-pandas
age_df = new_df[~np.isnan(new_df['Age']) ]
new_age = age_df['Age']
new_fare = age_df['Fare']

m, b = np.polyfit(new_fare, new_age,  1)
plt.plot(new_fare, new_age, '.')
plt.plot(new_fare, m * new_fare + b, '-')

print(new_age.shape)
print('m = ', m)


In [None]:
i give uo age

## Exploration Phase



In [None]:
sex_survived = df[['Sex', 'Survived']].groupby(['Sex'],as_index=False)
print( sex_survived.head() )
sex_survived = df[['Sex', 'Survived']].groupby(['Sex'],as_index=False).mean()
print( sex_survived ) 
sex_survived.plot.bar()


In [None]:
embark_survived = df[['Embarked', 'Survived']].groupby(['Embarked'],as_index=False)
print( embark_survived.head() ) 

embark_survived = df[['Embarked', 'Survived']].groupby(['Embarked'],as_index=False).mean()
print( embark_survived )
embark_survived.plot.bar()
#sns.barplot(x='Embarked', y='Survived', data=embark_perc,order=['S','C','Q'] )



In [None]:
fare.hist(bins = 20)


In [None]:
age.hist(bins = 20, range = (0,100) )

In [None]:

parch.plot(kind = 'hist', grid = True)


In [None]:
fare.hist()


## Conclusions Phase

