# Week 2 - Data Analysis
#### Part 4
### Notebook created by Jonathan Penava and modified by Simon Hood
By the end of this lesson you should be able to read in a data set from a CSV file into a DataFrame, and manipulate that DataFrame.

## Overview
- NumPy
- Pandas
- Data Frames
- <span style="color:red;">Reading CSV</span>

## Reading CSV

In this part we want to read a set of data and begin to prepare it for analysis.  What information is useful to us?  What information can be removed?  How can we change the data to make it easier to analyze? <br><br>
We are going to read in a data set for the Titanic passenger list.  Downloaded from:
https://www.kaggle.com/datasets/yasserh/titanic-dataset?select=Titanic-Dataset.csv

In [2]:
import numpy as np
import pandas as pd

In [6]:
titanicDf = pd.read_csv('Dataset - Titanic.csv')

It is important to know what each of the columns are and what the values represent.  
<ul>
<li>PassengerId - is an index value assigned to the data entry.</li>
<li>Survived - 1 if they survived, 0 otherwise.</li>
<li>Pclass - Ticket class: 1 = 1st, 2 = 2nd, 3 = 3rd</li>
<li>Name - Name of the passenger.</li>
<li>Sex - male or female for each passenger.</li>
<li>Age - age of the passenger</li>
<li>SibSb - No. of siblings / spouses aboard the Titanic</li>
<li>Parch - No. of parents / children aboard the Titanic</li>
<li>Ticket - Ticket Number</li>
<li>Fare - Passenger Fare</li>
<li>Embarked - Indicates where the passenger borded.</li>
</ul>

In [10]:
titanicDf.head()  #Displays the first 5 records as well as column names

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [11]:
titanicDf.info()  #Provides metadata information about columns

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB


Let's analyze a little of what we can see.
<ul>
<li>Cabin has a lot of null data.  This won't be very good for analysis.</li>
<li>Embarked does not tell us much about our passengers.</li>
<li>PassengerId does not give us useful data.</li>
</ul><br>
Let's get rid of the data we don't need or want.  What other data do you think might not be useful when determining if someone survived?

In [12]:
titanicDf.drop('Cabin', axis=1, inplace=True)
titanicDf.drop('Embarked', axis=1, inplace=True)
titanicDf.drop('PassengerId', axis=1, inplace=True)
titanicDf.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 9 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   Survived  891 non-null    int64  
 1   Pclass    891 non-null    int64  
 2   Name      891 non-null    object 
 3   Sex       891 non-null    object 
 4   Age       714 non-null    float64
 5   SibSp     891 non-null    int64  
 6   Parch     891 non-null    int64  
 7   Ticket    891 non-null    object 
 8   Fare      891 non-null    float64
dtypes: float64(2), int64(4), object(3)
memory usage: 62.8+ KB


In [14]:
titanicDf.head(5)

Unnamed: 0,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare
0,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25
1,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833
2,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925
3,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1
4,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05


In [15]:
titanicDf[titanicDf['Age'].isnull()]   #Find all records where Age is NaN

Unnamed: 0,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare
5,0,3,"Moran, Mr. James",male,,0,0,330877,8.4583
17,1,2,"Williams, Mr. Charles Eugene",male,,0,0,244373,13.0000
19,1,3,"Masselmani, Mrs. Fatima",female,,0,0,2649,7.2250
26,0,3,"Emir, Mr. Farred Chehab",male,,0,0,2631,7.2250
28,1,3,"O'Dwyer, Miss. Ellen ""Nellie""",female,,0,0,330959,7.8792
...,...,...,...,...,...,...,...,...,...
859,0,3,"Razi, Mr. Raihed",male,,0,0,2629,7.2292
863,0,3,"Sage, Miss. Dorothy Edith ""Dolly""",female,,8,2,CA. 2343,69.5500
868,0,3,"van Melkebeke, Mr. Philemon",male,,0,0,345777,9.5000
878,0,3,"Laleff, Mr. Kristo",male,,0,0,349217,7.8958


In [17]:
#Remove all NaN records from Age
indexAge = titanicDf[titanicDf['Age'].isnull()].index
titanicDf.drop(indexAge, inplace=True)

In [18]:
titanicDf.info()

<class 'pandas.core.frame.DataFrame'>
Index: 714 entries, 0 to 890
Data columns (total 9 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   Survived  714 non-null    int64  
 1   Pclass    714 non-null    int64  
 2   Name      714 non-null    object 
 3   Sex       714 non-null    object 
 4   Age       714 non-null    float64
 5   SibSp     714 non-null    int64  
 6   Parch     714 non-null    int64  
 7   Ticket    714 non-null    object 
 8   Fare      714 non-null    float64
dtypes: float64(2), int64(4), object(3)
memory usage: 55.8+ KB


It might be easier to analyze our data if we use numerical values instead of Strings.  We can change our sex column to have male=0 and female=1

In [19]:
titanicDf.replace('male', 0, inplace=True)
titanicDf.replace('female', 1, inplace=True)

  titanicDf.replace('female', 1, inplace=True)


In [20]:
titanicDf.info()

<class 'pandas.core.frame.DataFrame'>
Index: 714 entries, 0 to 890
Data columns (total 9 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   Survived  714 non-null    int64  
 1   Pclass    714 non-null    int64  
 2   Name      714 non-null    object 
 3   Sex       714 non-null    int64  
 4   Age       714 non-null    float64
 5   SibSp     714 non-null    int64  
 6   Parch     714 non-null    int64  
 7   Ticket    714 non-null    object 
 8   Fare      714 non-null    float64
dtypes: float64(2), int64(5), object(2)
memory usage: 55.8+ KB
