### Pandas Lab - Finding, Querying, Creating Data

### Section 1: Selecting Data

Complete the following prompts, and compare your results with the answers to confirm that you did the operation properly.

In [2]:
import numpy as np
import pandas as pd
df = pd.read_csv('../../data/titanic.csv')

**1).** Find the average age of all passengers on board.  The answer will be 29.7

In [3]:
df['Age'].mean()

29.69911764705882

**2).** Find the median value of the 'Fare' and 'SibSp' columns.  The answer will be 14.45 and 0.00, respectively.

In [4]:
df[['Fare','SibSp']].median()

Fare     14.4542
SibSp     0.0000
dtype: float64

**3).** Find the median value of the 'Fare' and 'SibSp' columns in the first 100 rows.  The answer will be 15.675 and 0.00.

In [5]:
df[['Fare','SibSp']][:100].median()

Fare     15.675
SibSp     0.000
dtype: float64

**4).** Using the .iloc command, grab the modal value of the last 4 columns in the dataset.  The result should be a 3x3 dataframe that has the values 1601, 8.05, B96 B98, and S in the first row.

In [7]:
df.iloc[:,-4:].mode()

Unnamed: 0,Ticket,Fare,Cabin,Embarked
0,1601,8.05,B96 B98,S
1,347082,,C23 C25 C27,
2,CA. 2343,,G6,


**5).** Using the .iloc command, grab the mean value of the first 250 rows of the first 3 columns in the dataset.  The answer should be:
    - Passengerid: 125.5
    - Survived: 0.344
    - Pclass: 2.416

In [9]:
df.iloc[:250,:3].mean()

PassengerId    125.500
Survived         0.344
Pclass           2.416
dtype: float64

### Section 2: Querying Data

**1).** How many females were on board the Titanic? Men? The answers should be 314, 577

In [11]:
df['Sex'].value_counts()

male      577
female    314
Name: Sex, dtype: int64

**2).** What was the survival rate for females on the Titanic? The answer should be 0.74

In [15]:
df[df['Sex'] == 'female']['Survived'].mean()

0.7420382165605095

**3).** What was the survival rate for Males? The answer should be 0.19

In [16]:
df[df['Sex'] == 'male']['Survived'].mean()

0.18890814558058924

**4).** What was the survival rate for passengers in either Pclass 1 or Pclass 2?  The answer should be 0.55

In [20]:
df[(df['Pclass'] == 1)
  | (df['Pclass'] == 2)]['Survived'].mean()

0.5575

**5).** What was the survival rate if you were female and had at least 1 sibling on board? The answer should be 0.686.

In [107]:
df[(df['Sex'] == 'female')
  & (df['SibSp'] >= 1)]['Survived'].mean()

0.6857142857142857

### Section 3: Creating New Data

**1).** Create a column called 'Is_Alone' that's either True or False, depending on whether or not the column 'Family_Size' is > 0 or not.

In [38]:
df['Family_Size'] = df['SibSp'] + df['Parch']
df['Is_Alone'] = np.where(df['Family_Size'] > 0, False, True)

**2).** Create a column called 'Demographic' that breaks people up into the following categories:
 - Below 8 years old: 'Child'
 - 8 - 21 years old: 'Adolescent'
 - 22 - 55 years old: 'Adult'
 - 55+: 'Senior'
 
When you're finished, use the method .value_counts() to confirm each section has the following count values:
 - Child: 49
 - Adolescent: 154
 - Adult: 470
 - Senior: 40

In [67]:
conditions = [
    (df['Age'] < 8),
    (df['Age'] >= 8) & (df['Age'] < 22),
    (df['Age'] >= 22) & (df['Age'] <= 55),
    (df['Age'] > 55)
]

status = ['Child','Adolescent','Adult','Senior']

df['Demographic'] = np.select(conditions, status, 'undefined')
df['Demographic'].value_counts()

Adult         470
undefined     177
Adolescent    154
Child          50
Senior         40
Name: Demographic, dtype: int64

**3).** Create a column called 'Gender_Status' that returns the following values:

 - 'F-High' if passenger is female and passenger class is 1.
 - 'F-Low' if passenger is female and passenger class is 2 or 3
 - 'M-High' if passenger is male and passenger class is 1
 - 'M-Low' if passenger is male and passenger class is 2 or 3
 
When you are finished, use .value_counts() to confirm you have the following count values for each of the following:
 - 'M-Low': 455
 - 'F-Low': 220
 - 'M-High': 122
 - 'F-High': 94

In [68]:
conditions = [
    (df['Sex'] == 'female') & (df['Pclass'] == 1),
    (df['Sex'] == 'female') & ((df['Pclass'] == 2) | (df['Pclass'] == 3)),
    (df['Sex'] == 'male') & (df['Pclass'] == 1),
    (df['Sex'] == 'male') & ((df['Pclass'] == 2) | (df['Pclass'] == 3))
]

status = ['F-High','F-Low','M-High','M-Low']

df['Gender_Status'] = np.select(conditions, status, 'undefined')
df['Gender_Status'].value_counts()

M-Low     455
F-Low     220
M-High    122
F-High     94
Name: Gender_Status, dtype: int64

**4).** Using string methods, extract the *greeting* of the persons name on board.  Ie, if someone's name is 'Ms. Madame Bovary', create a column that contains the value 'Ms.' and nothing else.

**Hint:** Take a look at the split() method and see what it does if you're not sure where to go.

In [3]:
df['title'] = pd.Series([x[0].split(',')[1].lstrip() + '.' for x in df['Name'].str.split('.')])
df[['Name','title']]

Unnamed: 0,Name,title
0,"Braund, Mr. Owen Harris",Mr.
1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",Mrs.
2,"Heikkinen, Miss. Laina",Miss.
3,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",Mrs.
4,"Allen, Mr. William Henry",Mr.
5,"Moran, Mr. James",Mr.
6,"McCarthy, Mr. Timothy J",Mr.
7,"Palsson, Master. Gosta Leonard",Master.
8,"Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)",Mrs.
9,"Nasser, Mrs. Nicholas (Adele Achem)",Mrs.


In [1]:
from sklearn.preprocessing import LabelEncoder

In [2]:
le = LabelEncoder()