## CASE STUDY: TITANIC DATA ANALYSIS

- Aggregating Data
    - summary statistics
    - counting
    - grouped summary statistics

## 3. Aggregating Data

#### Summarizing numerical data
- .mean()
- .median()
- .min()
- .maxx()
- .var()
- .std()
- .sum()
- .quantile()

In [46]:
import pandas as pd
import numpy as np

In [47]:
#to remove all the jupyter warnings
import warnings
warnings.filterwarnings('ignore')

In [48]:
titanic_df = pd.read_csv('titanic.csv')
titanic_df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [49]:
titanic_df['Age'].mean()

29.69911764705882

In [50]:
titanic_df['Age'].median()

28.0

In [51]:
titanic_df['Age'].min()

0.42

In [52]:
titanic_df['Age'].max()

80.0

In [53]:
titanic_df['Age'].var()

211.0191247463081

In [54]:
titanic_df['Age'].std()

14.526497332334044

In [55]:
titanic_df['Age'].sum()

21205.17

In [56]:
titanic_df['Age'].quantile()

28.0

#### .agg() method

- One or more operation on single Or multiple columns
- Function creation = parametrs eg: `def pct30(column):return column.quantile(0.3)`
- Function Calling = Arguments eg: `titanic['Age'].agg(pct30)`

In [57]:
#Aggregation on Single column
def percentile(column): 
    return column.quantile(0.3) # It will take a column as a perameter and find the 30 percentile

In [58]:
titanic_df['Age'].agg(percentile) #applying agg() on a column using simple function

22.0

#### Characteristics of Lambda function

- One line funtion
- Without name function
- Not used even before
- Not used even after

In [59]:
#Aggregation on Single column
# We can find the 30 percentile through one line of code using lambda function
titanic_df['Age'].agg(lambda column: column.quantile(0.3))

22.0

In [60]:
# Aggregation on multiple column
titanic_df[['Age', 'Fare']].agg(lambda col: col.quantile(0.5)) #It will find the 50 percentile of two columns

Age     28.0000
Fare    14.4542
dtype: float64

In [61]:
titanic_df[['PassengerId', 'Age', 'Fare']].agg(lambda col: col.quantile(0.5))

PassengerId    446.0000
Age             28.0000
Fare            14.4542
dtype: float64

## 4. Cumulative statistics
- .cumsum()
- .cummax()
- .cummin()
- .cumprod()

In [62]:
titanic_df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [63]:
pd.DataFrame(titanic_df['Age'].cumsum()).head() # The cumsum method calculates the cumulative sum of the elements in a Series or along a DataFrame axis.

Unnamed: 0,Age
0,22.0
1,60.0
2,86.0
3,121.0
4,156.0


In [64]:
pd.DataFrame(titanic_df['Age'].cummax()).head()
#The cummax method calculates the cumulative maximum of the elements in a Series or along a DataFrame axis.

Unnamed: 0,Age
0,22.0
1,38.0
2,38.0
3,38.0
4,38.0


In [65]:
pd.DataFrame(titanic_df['Age'].cummin()).head()
# The cummin method calculates the cumulative minimum of the elements in a Series or along a DataFrame axis.

Unnamed: 0,Age
0,22.0
1,22.0
2,22.0
3,22.0
4,22.0


In [66]:
pd.DataFrame(titanic_df['Age'].cumprod()).head()
# The cumprod method calculates the cumulative product of the elements in a Series or along a DataFrame axis.

Unnamed: 0,Age
0,22.0
1,836.0
2,21736.0
3,760760.0
4,26626600.0


## 5. Counting

- So far, in this chapter, you've learned how to ``summarize numeric variables``. In below notebook, you'll learn how to ``summarize categorical data`` using counting.

- Categorical variables represent types of **data which may be divided into groups**. Examples of categorical variables are race, sex, age group, and educational

#### Drop Duplicates by Single & Multiple Columns
- drop_duplicates(subset= "")

In [67]:
titanic_df.drop_duplicates(subset="Pclass")
# It use to find that how many categories are there in a veriable/column
# e.g: Pclass contains 3 categories (3,1,2)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
9,10,1,2,"Nasser, Mrs. Nicholas (Adele Achem)",female,14.0,1,0,237736,30.0708,,C


In [68]:
titanic_df.drop_duplicates(subset="Sex") # The inplace method is false by default, if I make it true so it will change my original DataFrame
# Same asitis Sex contains 2 categories (male/female)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C


In [69]:
# The upper cell is a shallow copy and it will not change the values in original dataframe
# Deep copy: change the values of original array
# Shallow copy: doesn't change the values in original array

In [70]:
#drop duplicates on multiple variables
titanic_df.drop_duplicates(subset=["Pclass", "SibSp"])

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
6,7,0,1,"McCarthy, Mr. Timothy J",male,54.0,0,0,17463,51.8625,E46,S
7,8,0,3,"Palsson, Master. Gosta Leonard",male,2.0,3,1,349909,21.075,,S
9,10,1,2,"Nasser, Mrs. Nicholas (Adele Achem)",female,14.0,1,0,237736,30.0708,,C
15,16,1,2,"Hewlett, Mrs. (Mary D Kingcome)",female,55.0,0,0,248706,16.0,,S
16,17,0,3,"Rice, Master. Eugene",male,2.0,4,1,382652,29.125,,Q
27,28,0,1,"Fortune, Mr. Charles Alexander",male,19.0,3,2,19950,263.0,C23 C25 C27,S
38,39,0,3,"Vander Planke, Miss. Augusta Maria",female,18.0,2,0,345764,18.0,,S


#### drop duplicates on multiple variables
drop duplicates on multiple variables is importent because let suppose if I have 2 persons & their first names are same but last names are different like(Umair Nawaz, Umair Khan), now if I use drop-duplicates on single column called 'first name' so it will check the first name just and count 1 name 'Umair' but if I use drop duplicates on both 'first name & last name' variables so it will count 2 different name as you may see in the upper dataframe so it makes importent to perform drop-duplicates on multiple variables.

### Getting Count Stats using `.value_counts()`
-  the value_counts method is used to count the occurrences of each unique value in a Series. This is particularly useful for understanding the distribution of values in a dataset.
    - sort=False
    - normalize=True

In [71]:
titanic_df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [72]:
pd.DataFrame(titanic_df.value_counts('Age'))

Unnamed: 0_level_0,count
Age,Unnamed: 1_level_1
24.00,30
22.00,27
18.00,26
30.00,25
28.00,25
...,...
20.50,1
14.50,1
12.00,1
0.92,1


In [73]:
pd.DataFrame(titanic_df.value_counts('Survived'))

Unnamed: 0_level_0,count
Survived,Unnamed: 1_level_1
0,549
1,342


In [74]:
pd.DataFrame(titanic_df.value_counts(['Sex', 'Survived'], sort=False))

Unnamed: 0_level_0,Unnamed: 1_level_0,count
Sex,Survived,Unnamed: 2_level_1
female,0,81
female,1,233
male,0,468
male,1,109


In [75]:
pd.DataFrame(titanic_df.value_counts(['Sex', 'SibSp', 'Survived']))

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,count
Sex,SibSp,Survived,Unnamed: 3_level_1
male,0,0,361
female,0,1,137
female,1,1,80
male,0,1,73
male,1,0,71
female,0,0,37
male,1,1,32
female,1,0,26
male,2,0,12
male,4,0,11


In [76]:
pd.DataFrame(titanic_df.value_counts('Age', normalize="True"))
# normalize argument can be used to turn the counts into proportions of the total. 25%, 50%, 75%

Unnamed: 0_level_0,proportion
Age,Unnamed: 1_level_1
24.00,0.042017
22.00,0.037815
18.00,0.036415
30.00,0.035014
28.00,0.035014
...,...
20.50,0.001401
14.50,0.001401
12.00,0.001401
0.92,0.001401


## 6. Group summary statistics

- Average age of Males & Females Using subsetting
- Average age of Males & Females Using `.groupby()`
- Apply Different statistics methods like mean, counts, max & group.

In [77]:
titanic_df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [78]:
titanic_df[titanic_df['Sex'] == 'male']['Age'].mean()

30.72664459161148

In [79]:
titanic_df[titanic_df['Sex'] == 'female']['Age'].mean()

27.915708812260537

In [80]:
titanic_df.groupby('Sex')['Age'].mean()
#groupby: First it will find the categories of provided variable/column and then perform the provided task on each category

Sex
female    27.915709
male      30.726645
Name: Age, dtype: float64

In [81]:
titanic_df.groupby(['Sex', 'PassengerId', 'Name', 'Pclass'])[['Survived', 'SibSp', 'Fare']].mean()

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Unnamed: 3_level_0,Survived,SibSp,Fare
Sex,PassengerId,Name,Pclass,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
female,2,"Cumings, Mrs. John Bradley (Florence Briggs Thayer)",1,1.0,1.0,71.2833
female,3,"Heikkinen, Miss. Laina",3,1.0,0.0,7.9250
female,4,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",1,1.0,1.0,53.1000
female,9,"Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)",3,1.0,0.0,11.1333
female,10,"Nasser, Mrs. Nicholas (Adele Achem)",2,1.0,1.0,30.0708
...,...,...,...,...,...,...
male,884,"Banfield, Mr. Frederick James",2,0.0,0.0,10.5000
male,885,"Sutehall, Mr. Henry Jr",3,0.0,0.0,7.0500
male,887,"Montvila, Rev. Juozas",2,0.0,0.0,13.0000
male,890,"Behr, Mr. Karl Howell",1,1.0,0.0,30.0000


In [82]:
pd.DataFrame(titanic_df.groupby(['Survived', 'Sex'])['Age'].count())# < -- multiple group

Unnamed: 0_level_0,Unnamed: 1_level_0,Age
Survived,Sex,Unnamed: 2_level_1
0,female,64
0,male,360
1,female,197
1,male,93


In [83]:
titanic_df.groupby('Sex')['Age'].agg(['count', 'min', 'max']) # <-- multiple stats

Unnamed: 0_level_0,count,min,max
Sex,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
female,261,0.75,63.0
male,453,0.42,80.0


In [84]:
titanic_df.groupby(['Survived', 'Sex'])[['Age', 'SibSp']].agg(['count', 'min', 'max'])

Unnamed: 0_level_0,Unnamed: 1_level_0,Age,Age,Age,SibSp,SibSp,SibSp
Unnamed: 0_level_1,Unnamed: 1_level_1,count,min,max,count,min,max
Survived,Sex,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2
0,female,64,2.0,57.0,81,0,8
0,male,360,1.0,74.0,468,0,8
1,female,197,0.75,63.0,233,0,4
1,male,93,0.42,80.0,109,0,4


## Pivot tables

In [85]:
pd.DataFrame(titanic_df.groupby('Sex')['Age'].mean())

Unnamed: 0_level_0,Age
Sex,Unnamed: 1_level_1
female,27.915709
male,30.726645


- The ``"values"`` argument is the column that you want to ``summarize/Operation``, and the ``"index"`` column is the column that you want to ``group by``. 
- By default, pivot_table takes the **mean** value for each group.

In [86]:
#pivot and implicitly define agffunc=np.mean
titanic_df.pivot_table(values='Age', index='Sex')

Unnamed: 0_level_0,Age
Sex,Unnamed: 1_level_1
female,27.915709
male,30.726645


In [87]:
#explicitly define statistics i:e np.median
titanic_df.pivot_table(values='Age', index='Sex', aggfunc=np.mean)

Unnamed: 0_level_0,Age
Sex,Unnamed: 1_level_1
female,27.915709
male,30.726645


In [88]:
#multiple statistics
titanic_df.pivot_table(values='Age', index='Sex', aggfunc=[np.max, np.std])

Unnamed: 0_level_0,max,std
Unnamed: 0_level_1,Age,Age
Sex,Unnamed: 1_level_2,Unnamed: 2_level_2
female,63.0,14.110146
male,80.0,14.678201


### pivot on two varibales
- To group by two variables, we can pass a second variable name into the columns argument.

In [89]:
#in groupby

#titanic_df.groupby(['Survived','Sex'])['Age'].mean().unstack()

#pivot on two varibales
titanic_df.pivot_table(values='Age', columns='Survived', index='Sex', aggfunc=np.mean)

Survived,0,1
Sex,Unnamed: 1_level_1,Unnamed: 2_level_1
female,25.046875,28.847716
male,31.618056,27.276022


#### filling missing values in pivot table

In [90]:
titanic_df.pivot_table(values='Age', columns='Survived', index='Sex', fill_value=0)
#fill_values used to fill nan values

Survived,0,1
Sex,Unnamed: 1_level_1,Unnamed: 2_level_1
female,25.046875,28.847716
male,31.618056,27.276022


#### summing with pivot table
Using margins equals True allows us to see a summary statistic for multiple levels of the dataset: the entire dataset, grouped by one variable, by another variable, and by two variables.

In [93]:
titanic_df.pivot_table(values='Age', index='Sex', columns='Survived', fill_value=0, margins=True)
# margin=True: It will add a variable and observation in the last & calculate the mean along the rows or columns

Survived,0,1,All
Sex,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
female,25.046875,28.847716,27.915709
male,31.618056,27.276022,30.726645
All,30.626179,28.34369,29.699118


In [94]:
# To change the name which added in the last 'All'
titanic_df.pivot_table(values='Age', index='Sex', columns='Survived', fill_value=0, margins=True, margins_name='Mean')

Survived,0,1,Mean
Sex,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
female,25.046875,28.847716,27.915709
male,31.618056,27.276022,30.726645
Mean,30.626179,28.34369,29.699118


## Thank You