You can use the editor on GitHub to maintain and preview the content for your website in Markdown files.
Whenever you commit to this repository, GitHub Pages will run Jekyll to rebuild the pages in your site, from the content in your Markdown files.
Markdown is a lightweight and easy-to-use syntax for styling your writing. It includes conventions for
Syntax highlighted code block
# Header 1
## Header 2
### Header 3
- Bulleted
- List
1. Numbered
2. List
**Bold** and _Italic_ and `Code` text
[Link](url) and ![Image](src)
For more details see GitHub Flavored Markdown.
Your Pages site will use the layout and styles from the Jekyll theme you have selected in your repository settings. The name of this theme is saved in the Jekyll _config.yml
configuration file.
<script src="https://cdn.mathjax.org/mathjax/latest/MathJax.js?config=TeX-AMS_HTML"></script>
<!-- MathJax configuration -->
<script type="text/x-mathjax-config">
MathJax.Hub.Config({
tex2jax: {
inlineMath: [ ['$','$'], ["\\(","\\)"] ],
displayMath: [ ['$$','$$'], ["\\[","\\]"] ],
processEscapes: true,
processEnvironments: true
},
// Center justify equations in code and markdown cells. Elsewhere
// we use CSS to left justify single line equations in code cells.
displayAlign: 'center',
"HTML-CSS": {
styles: {'.MathJax_Display': {"margin": 0}},
linebreaks: { automatic: true }
}
});
</script>
<!-- End of mathjax configuration --></head>
Titanic - Data Analysis¶
Introduction¶
As taken from Kaggle, 'The sinking of the RMS Titanic is one of the most infamous shipwrecks in history. On April 15, 1912, during her maiden voyage, the Titanic sank after colliding with an iceberg, killing 1502 out of 2224 passengers and crew. This sensational tragedy shocked the international community and led to better safety regulations for ships.'
The data set used for this analysis is a subset of the complete data set and contains demographics and passenger information from 891 of the 2224 passengers and crew on board the Titanic.
For more information on the data set, visit Kaggle.
Goal¶
The goal of this analysis is to analyze the data set and provide insights or identify patterns by investigating the data set.
Import Packages¶
#Import and load necessary packages import re import pandas as pd import seaborn as sns import numpy as np from matplotlib import pyplot as pyplt from matplotlib import figure as fig from matplotlib.gridspec import GridSpec#This is to plot the graphs within the same jupyter cell %matplotlib inline
#Adjusting the plot area size to accommodate bigger/wider graphs pyplt.rcParams['figure.figsize'] = 12,6
Load data¶
#Load dataset #This loads the csv file to a pandas data frame #Note that the path given below is where the csv file exists locally with in Python's current working directory #Repoint the path as necessary to make sure the function read_csv is able to find the csvtitanic = pd.read_csv("Udacity/P2/Titanic/titanic-data.csv")
Data Exploration and Processing¶
#Exploring the dataset for available columns and type of data titanic.head(10)
PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | NaN | S |
1 | 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | female | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C85 | C |
2 | 3 | 1 | 3 | Heikkinen, Miss. Laina | female | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | NaN | S |
3 | 4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35.0 | 1 | 0 | 113803 | 53.1000 | C123 | S |
4 | 5 | 0 | 3 | Allen, Mr. William Henry | male | 35.0 | 0 | 0 | 373450 | 8.0500 | NaN | S |
5 | 6 | 0 | 3 | Moran, Mr. James | male | NaN | 0 | 0 | 330877 | 8.4583 | NaN | Q |
6 | 7 | 0 | 1 | McCarthy, Mr. Timothy J | male | 54.0 | 0 | 0 | 17463 | 51.8625 | E46 | S |
7 | 8 | 0 | 3 | Palsson, Master. Gosta Leonard | male | 2.0 | 3 | 1 | 349909 | 21.0750 | NaN | S |
8 | 9 | 1 | 3 | Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg) | female | 27.0 | 0 | 2 | 347742 | 11.1333 | NaN | S |
9 | 10 | 1 | 2 | Nasser, Mrs. Nicholas (Adele Achem) | female | 14.0 | 1 | 0 | 237736 | 30.0708 | NaN | C |
While the field Embarked, Survived and Pclass hold abbreviations for the port of embarkation, survival indicator and travel class, having descriptive names of the same look more meaningful especially in visulaizations.
Note that the port names were taken from the Kaggle website.
ports = ["Cherbourg", "Queenstown", "Southampton"] survival = ["Dead", "Survived"] travelclass = ["First Class", "Second Class", "Third Class"]
# Creating descriptive labels # Survival label titanic['Survival'] = titanic.Survived.map({0 : survival[0], 1 : survival[1]})# Ports label titanic['Ports'] = titanic.Embarked.map({"C" : ports[0], "Q" : ports[1], "S" : ports[2]})
# Travel class label titanic['TravelClass'] = titanic.Pclass.map({1 : travelclass[0], 2 : travelclass[1], 3 : travelclass[2]})
#Making sure the Embarked field has been updated titanic.head()
PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | Survival | Ports | TravelClass | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | NaN | S | Dead | Southampton | Third Class |
1 | 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | female | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C85 | C | Survived | Cherbourg | First Class |
2 | 3 | 1 | 3 | Heikkinen, Miss. Laina | female | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | NaN | S | Survived | Southampton | Third Class |
3 | 4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35.0 | 1 | 0 | 113803 | 53.1000 | C123 | S | Survived | Southampton | First Class |
4 | 5 | 0 | 3 | Allen, Mr. William Henry | male | 35.0 | 0 | 0 | 373450 | 8.0500 | NaN | S | Dead | Southampton | Third Class |
#Taking a look at the basic stats titanic.describe()
PassengerId | Survived | Pclass | Age | SibSp | Parch | Fare | |
---|---|---|---|---|---|---|---|
count | 891.000000 | 891.000000 | 891.000000 | 714.000000 | 891.000000 | 891.000000 | 891.000000 |
mean | 446.000000 | 0.383838 | 2.308642 | 29.699118 | 0.523008 | 0.381594 | 32.204208 |
std | 257.353842 | 0.486592 | 0.836071 | 14.526497 | 1.102743 | 0.806057 | 49.693429 |
min | 1.000000 | 0.000000 | 1.000000 | 0.420000 | 0.000000 | 0.000000 | 0.000000 |
25% | 223.500000 | 0.000000 | 2.000000 | 20.125000 | 0.000000 | 0.000000 | 7.910400 |
50% | 446.000000 | 0.000000 | 3.000000 | 28.000000 | 0.000000 | 0.000000 | 14.454200 |
75% | 668.500000 | 1.000000 | 3.000000 | 38.000000 | 1.000000 | 0.000000 | 31.000000 |
max | 891.000000 | 1.000000 | 3.000000 | 80.000000 | 8.000000 | 6.000000 | 512.329200 |
How many people were on the ship and how many survived?¶
There were 891 passengers on Titanic and of them only 342 survived with an overall survival rate of 38%.
#Get some basic counts and statstotalpassengers = len(set(titanic["PassengerId"])) survived = sum(titanic["Survived"]) overallsurvivalrate = format((survived/float(totalpassengers)) * 100, '0.2f')
print "Total Passengers on Titanic", totalpassengers print "Passengers Survived", survived print "Overall Survival Rate", overallsurvivalrate,"%"
Total Passengers on Titanic 891 Passengers Survived 342 Overall Survival Rate 38.38 %
How has the travel class on the ship and the gender of passengers affect the survival rate?¶
From the bar plot below, it is evident that women in the 1st class had the highest survival rate at 10% followed by women in 2nd and 3rd classes at around 8% each. While the survival rate of men in 1st and 3rd class was around 5%, men in 2nd class seem to have the lowest survival rate.
#Builds a bar plot with gender and travel class on the x-axis; number of people survived on the y-axis and survival rate in % genderplt = sns.barplot(x = "Sex", y = "Survived", hue = "TravelClass", hue_order= travelclass, \ data = titanic, estimator = np.sum, ci = None) genderplt.set(ylabel = "People Survived", xlabel = "Gender") for p in genderplt.patches: height = p.get_height() genderplt.text(p.get_x()+p.get_width()/2., height + 3, '{:1.2f}'.format((height * 100)/totalpassengers), ha="center") genderplt.get_axes().legend(title = "Travel Class") genderplt.set(title = "Survival Rate across Gender and Travel Class")
The plot below gives a better picture of the same. While the deaths in the first and second class were contained, third class seems to be the worst affected for both men and women.
#Builds a countplot split on travel class factorplt = sns.factorplot(x="Sex", hue="Survival", col="Pclass", data=titanic, kind="count")factorplt.set(ylabel = "Number of People", xlabel = "Gender")
titles = travelclass
for ax, title in zip(factorplt.axes.flat, titles): ax.set_title(title)
<span class="k">for</span> <span class="n">p</span> <span class="ow">in</span> <span class="n">ax</span><span class="o">.</span><span class="n">patches</span><span class="p">:</span> <span class="n">height</span> <span class="o">=</span> <span class="n">p</span><span class="o">.</span><span class="n">get_height</span><span class="p">()</span> <span class="n">ax</span><span class="o">.</span><span class="n">text</span><span class="p">(</span><span class="n">p</span><span class="o">.</span><span class="n">get_x</span><span class="p">()</span><span class="o">+</span><span class="n">p</span><span class="o">.</span><span class="n">get_width</span><span class="p">()</span><span class="o">/</span><span class="mf">2.</span><span class="p">,</span> <span class="n">height</span> <span class="o">+</span> <span class="mi">3</span><span class="p">,</span> <span class="s1">'{:1.2f}'</span><span class="o">.</span><span class="n">format</span><span class="p">((</span><span class="n">height</span> <span class="o">*</span> <span class="mi">100</span><span class="p">)</span><span class="o">/</span><span class="n">totalpassengers</span><span class="p">),</span> <span class="n">ha</span><span class="o">=</span><span class="s2">"center"</span><span class="p">)</span>
The travel class has had a huge impact on the survival of passengers on the Titanic, with people travelling by 1st and 2nd class having a higher probability of survival and within those classes, women having a better chance of surviving over men.
How has the age of passengers affected the survival rate?¶
There are a lot of instances where Age is either missing values or have fractionals in them. For this particular analysis, missing ages have been filled with 0 and the fractinals have been rounded.
titanic['Age'].unique()
array([ 22. , 38. , 26. , 35. , nan, 54. , 2. , 27. , 14. , 4. , 58. , 20. , 39. , 55. , 31. , 34. , 15. , 28. , 8. , 19. , 40. , 66. , 42. , 21. , 18. , 3. , 7. , 49. , 29. , 65. , 28.5 , 5. , 11. , 45. , 17. , 32. , 16. , 25. , 0.83, 30. , 33. , 23. , 24. , 46. , 59. , 71. , 37. , 47. , 14.5 , 70.5 , 32.5 , 12. , 9. , 36.5 , 51. , 55.5 , 40.5 , 44. , 1. , 61. , 56. , 50. , 36. , 45.5 , 20.5 , 62. , 41. , 52. , 63. , 23.5 , 0.92, 43. , 60. , 10. , 64. , 13. , 48. , 0.75, 53. , 57. , 80. , 70. , 24.5 , 6. , 0.67, 30.5 , 0.42, 34.5 , 74. ])
titanic['Age'].isnull().sum()
177
# A function to clean up the age field def cleanup(df, field, convtoint):<span class="c1">#Check if there are any missing values (Nan and make them 0</span> <span class="n">df</span><span class="p">[</span><span class="n">field</span><span class="p">]</span><span class="o">.</span><span class="n">fillna</span><span class="p">(</span><span class="n">value</span> <span class="o">=</span> <span class="mi">0</span><span class="p">,</span> <span class="n">inplace</span> <span class="o">=</span> <span class="bp">True</span><span class="p">)</span> <span class="c1">#Check if it needs to be converted to int</span> <span class="k">if</span> <span class="n">convtoint</span> <span class="o">==</span> <span class="mi">1</span><span class="p">:</span> <span class="n">df</span><span class="p">[</span><span class="n">field</span><span class="p">]</span> <span class="o">=</span> <span class="n">df</span><span class="p">[</span><span class="n">field</span><span class="p">]</span><span class="o">.</span><span class="n">astype</span><span class="p">(</span><span class="nb">int</span><span class="p">)</span> <span class="c1">#Round the values</span> <span class="n">df</span><span class="p">[</span><span class="n">field</span><span class="p">]</span><span class="o">.</span><span class="n">round</span><span class="p">(</span><span class="mi">0</span><span class="p">)</span> <span class="k">return</span> <span class="n">field</span> <span class="o">+</span> <span class="s2">" has been cleaned up"</span>
cleanup(titanic, "Age", 1)
'Age has been cleaned up'
average_age_titanic =titanic['Age'].mean() std_age_titanic =titanic['Age'].std()print average_age_titanic print std_age_titanic
23.7833894501 17.5973438421
In the initial phase of analysis, filling the missing ages with 0, skewed the distribution graphs to the left and since 0 is not a valid age for those passengers and is just missing, Age 0 has been excluded from the analysis.
Although age 0 has been excluded for the purposes of this analysis, predictions or standardizing ages of those passengers could also be made to improve the accuracy of the charts.
average_age_titanic =titanic[titanic.Age > 0]['Age'].mean() std_age_titanic =titanic[titanic.Age > 0]['Age'].std()print average_age_titanic print std_age_titanic
29.973125884 14.3032955152
#Builds a age distribution plot agedist = sns.distplot(titanic[titanic.Age > 0]['Age']) agedist.set_xlim(0, max(titanic[titanic.Age > 0]['Age'])+10) agedist.set_title("Age distribution") agedist.set_ylabel("Age") agedist.set_ylabel("Density of population")
#Builds a plot on age and survival ageplt = titanic.groupby(['Survival', 'Age']).count().unstack('Survival')["PassengerId"] ageplt[1:].plot(kind='bar', stacked=True, title = "Number of passengers over Age") pyplt.ylabel("Number of passengers") pyplt.xlabel("Age")
From the above visualization, it can be noted that the highest population of passengers is of the age group 16-40 and the least survival rate of passengers is of the age group 18-30.
What effect did the fare paid have on survival rate?¶
As it is very noticeable from the Y-axis of the visualizations below, 1st class had fares upto a little more than 500 and it looks like only 2 passengers paid more than 500. What is interesting though is that a big majority of passengers in 1st class have paid less than 100 and on an verage people who have paid more have survived.
While in second and third classes, majority of passengers have paid fares below 30 and 20 respectively.
On an average, older people have paid lesser fares in all the classes alike and the chances of survival also decreases as the age of passenger increases.
Please note that the visualizations do not include passengers is age has missing values or is 0.
#Builds a linear relationship between age and fare over survival fareplt = sns.lmplot(x="Age", y="Fare", hue = 'Survival', col="Pclass", data=titanic[titanic.Age>0], sharey = False) fareplt.set(ylabel = "Fare", xlabel = "Age") fareplt.axes[0,0].set_ylim(0,) fareplt.axes[0,1].set_ylim(0,) fareplt.axes[0,2].set_ylim(0,) fareplt.axes[0,0].set_xlim(0,) fareplt.axes[0,1].set_xlim(0,) fareplt.axes[0,2].set_xlim(0,) titles = travelclassfor ax, title in zip(fareplt.axes.flat, titles): ax.set_title(title)
What role has the port of embarakation played?¶
While Southampton had the most number of passengers, it also had the least survival rate. This is probably because Southampton also had the most number of passengers in 3rd class. It can be concluded that Southampton is a really busy port. Queenstown probably is the least popular port with minimal passengers, especially among 1st and 2nd travel classes.
emb = titanic.groupby(['Survival', 'Ports','TravelClass']).count().unstack('Ports')['PassengerId'] emb.head(6)
Ports | Cherbourg | Queenstown | Southampton | |
---|---|---|---|---|
Survival | TravelClass | |||
Dead | First Class | 26 | 1 | 53 |
Second Class | 8 | 1 | 88 | |
Third Class | 41 | 45 | 286 | |
Survived | First Class | 59 | 1 | 74 |
Second Class | 9 | 2 | 76 | |
Third Class | 25 | 27 | 67 |
#Builds a bar plot betwen embarkation port and passengers survived over travel class portplt = sns.barplot(x = "Ports", y = "Survived", hue = "TravelClass", hue_order = travelclass, data = titanic, estimator = np.sum, ci = None) portplt.set(ylabel = "People Survived", xlabel = "Embarkation Area") portplt.set(title = "Number of people that survived across travel class")
emb = titanic.groupby(['Survival', 'Ports']).count().unstack('Ports')['PassengerId'] emb = emb/emb.sum()*100 emb.head()
Ports | Cherbourg | Queenstown | Southampton |
---|---|---|---|
Survival | |||
Dead | 44.642857 | 61.038961 | 66.304348 |
Survived | 55.357143 | 38.961039 | 33.695652 |
#Builds a function that builds a pie plot on survival/death rate def plotpie(series):<span class="n">i</span> <span class="o">=</span> <span class="mi">2</span> <span class="k">for</span> <span class="n">c</span> <span class="ow">in</span> <span class="nb">list</span><span class="p">(</span><span class="n">series</span><span class="p">):</span> <span class="n">pyplt</span><span class="o">.</span><span class="n">figure</span><span class="p">(</span><span class="n">i</span><span class="p">,</span> <span class="n">figsize</span> <span class="o">=</span> <span class="p">(</span><span class="mi">4</span><span class="p">,</span><span class="mi">2</span><span class="p">))</span> <span class="n">pyplt</span><span class="o">.</span><span class="n">pie</span><span class="p">(</span><span class="n">emb</span><span class="p">[</span><span class="n">c</span><span class="p">],</span> <span class="n">labels</span><span class="o">=</span><span class="n">survival</span><span class="p">,</span> <span class="n">colors</span><span class="o">=</span><span class="p">[</span><span class="s1">'red'</span><span class="p">,</span> <span class="s1">'green'</span><span class="p">],</span> <span class="n">autopct</span><span class="o">=</span><span class="s1">'</span><span class="si">%1.1f%%</span><span class="s1">'</span><span class="p">,</span> \ <span class="n">shadow</span><span class="o">=</span><span class="bp">True</span><span class="p">,</span> <span class="n">startangle</span><span class="o">=</span><span class="mi">90</span><span class="p">)</span> <span class="c1"># View the plot drop above</span> <span class="n">pyplt</span><span class="o">.</span><span class="n">axis</span><span class="p">(</span><span class="s1">'equal'</span><span class="p">)</span> <span class="n">pyplt</span><span class="o">.</span><span class="n">title</span><span class="p">(</span><span class="s2">"Survival Rate at "</span> <span class="o">+</span> <span class="n">c</span><span class="p">)</span> <span class="n">i</span><span class="o">+=</span> <span class="mi">1</span>
#To build pie plots at different embarkation ports plotpie(emb[["Cherbourg", "Queenstown", "Southampton"]])
Conclusion¶
The features below were analyzed in this analysis:
- Gender
- Travel Class
- Age
- Fare
- Port of Embarkation
The most evident findings in this analysis are that:
- passengers in the upper class had a better chance of surviving than the passengers in lower class
- women had a better chance of surviving over men
- passengers with ages between 18 and 30 were more and also had a very poor survival rate
- Southamptopn port seemed busier and popular with more number of passengers and Queenstown port the least with very few passengers especially in first and second classes
Limitations¶
Some limitations of this analysis include the dataset and the missing values in Age and Cabin. This dataset contains details about less than half the passengers on Titanic within which age and cabin have more than 170 and 650 missing values respectively.
Filling the missing values of age with 0 results in skewing the visualizations to the left and so, this analysis excludes the records with age missing or 0.
Cabin also could be a strong indicator of survival as some survival boats could have been more accessible from some of the cabin areas and some might have very little or late access.
The precision of this analysis could be improved with more information on these fields and with a larger dataset.
References¶
Having trouble with Pages? Check out our documentation or contact support and we’ll help you sort it out.