## Example: Pandas Fundamentals

---


### Dataset Overview

The Titanic dataset contains data on the passengers and crew of the Titanic, which sank in the Atlantic Ocean in 1912. The dataset includes information on passenger class, age, sex, and survival status.

In [2]:
import pandas as pd 
df = pd.read_csv("https://raw.githubusercontent.com/osias1997/python_pandas_data_analyse/master/data/titanic.csv")

# The head() method displays, by default, the first 5 rows of the DataFrame named “df”. 
df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [3]:
# The tail() method displays, by default, the last 5 rows of the DataFrame named “df”. 

df.tail()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
886,887,0,2,"Montvila, Rev. Juozas",male,27.0,0,0,211536,13.0,,S
887,888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0,B42,S
888,889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,,1,2,W./C. 6607,23.45,,S
889,890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0,C148,C
890,891,0,3,"Dooley, Mr. Patrick",male,32.0,0,0,370376,7.75,,Q


In [4]:
# The integer 8 inside the parentheses means that the last 8 rows of the DataFrame named “df” are displayed. 

df.tail(8) 


Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
883,884,0,2,"Banfield, Mr. Frederick James",male,28.0,0,0,C.A./SOTON 34068,10.5,,S
884,885,0,3,"Sutehall, Mr. Henry Jr",male,25.0,0,0,SOTON/OQ 392076,7.05,,S
885,886,0,3,"Rice, Mrs. William (Margaret Norton)",female,39.0,0,5,382652,29.125,,Q
886,887,0,2,"Montvila, Rev. Juozas",male,27.0,0,0,211536,13.0,,S
887,888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0,B42,S
888,889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,,1,2,W./C. 6607,23.45,,S
889,890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0,C148,C
890,891,0,3,"Dooley, Mr. Patrick",male,32.0,0,0,370376,7.75,,Q


### Getting to know the dataset

* Find out the number of rows and the number of columsn in this Pandas DataFrame. 

In [5]:
# The “shape” attribute outputs a tuple that displays the number of rows, 
# followed by the number of columns. 
# Since it is an attribute, not a function or a method, there are no parentheses here.  

df.shape

(891, 12)

* Get a quick summary of the Pandas DataFrame

In [6]:
# Provides a summary of the df DataFrame, including the data types of its columns, 
# the number of non-null values, and the memory usage.

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB


### Generate descriptive statistics

The dataframe_name.describe() method generates descriptive statistics that summarize the central tendency, dispersion, and shape of a dataset's distribution excluding NaN values. It analyzes both numeric and object series, as well as DataFrame column sets of mixed data types. The output will vary depending on what is provided.

In [7]:
df.describe()


# For numeric columns, the following descriptive statistics are generated:
# * count: The number of non-empty values.
# * mean: The average (mean) value.
# * std: The standard deviation.
# * min: The minimum value.
# * 25%: The 25th percentile (the value below which 25% of the values fall).
# * 50%: The 50th percentile (the median or middle value).
# * 75%: The 75th percentile (the value below which 75% of the values fall).
# * max: The maximum value.

Unnamed: 0,PassengerId,Survived,Pclass,Age,SibSp,Parch,Fare
count,891.0,891.0,891.0,714.0,891.0,891.0,891.0
mean,446.0,0.383838,2.308642,29.699118,0.523008,0.381594,32.204208
std,257.353842,0.486592,0.836071,14.526497,1.102743,0.806057,49.693429
min,1.0,0.0,1.0,0.42,0.0,0.0,0.0
25%,223.5,0.0,2.0,20.125,0.0,0.0,7.9104
50%,446.0,0.0,3.0,28.0,0.0,0.0,14.4542
75%,668.5,1.0,3.0,38.0,1.0,0.0,31.0
max,891.0,1.0,3.0,80.0,8.0,6.0,512.3292


In [8]:
# Create list of object columns in DataFrame df
object_columns = df.select_dtypes(include=['object']).columns

df[object_columns].describe()


# For object columns, the following descriptive statistics are generated:
# * count: The number of non-empty values.
# * unique: The number of unique values.
# * top: The most frequent value.
# * freq: The frequency of the most frequent value.

Unnamed: 0,Name,Sex,Ticket,Cabin,Embarked
count,891,891,891,204,889
unique,891,2,681,147,3
top,"Braund, Mr. Owen Harris",male,347082,B96 B98,S
freq,1,577,7,4,644


In [9]:
int_object_columns = df.select_dtypes(include=['int64']).columns

df[int_object_columns].describe()

Unnamed: 0,PassengerId,Survived,Pclass,SibSp,Parch
count,891.0,891.0,891.0,891.0,891.0
mean,446.0,0.383838,2.308642,0.523008,0.381594
std,257.353842,0.486592,0.836071,1.102743,0.806057
min,1.0,0.0,1.0,0.0,0.0
25%,223.5,0.0,2.0,0.0,0.0
50%,446.0,0.0,3.0,0.0,0.0
75%,668.5,1.0,3.0,1.0,0.0
max,891.0,1.0,3.0,8.0,6.0


### Select rows and/or columns using Pandas
There are a variety of ways to select rows and/or columns in pandas. Here are some of the most common methods:
* Using square brackets: You can use square brackets to select rows and/or columns by index. 
* Using the loc and iloc methods: The loc and iloc methods can be used to select rows and/or columns by label or index, respectively.

In [83]:
# df["Name"] returns a Pandas Series with the column values from the "Name" column

df["Name"]


0                                Braund, Mr. Owen Harris
1      Cumings, Mrs. John Bradley (Florence Briggs Th...
2                                 Heikkinen, Miss. Laina
3           Futrelle, Mrs. Jacques Heath (Lily May Peel)
4                               Allen, Mr. William Henry
                             ...                        
886                                Montvila, Rev. Juozas
887                         Graham, Miss. Margaret Edith
888             Johnston, Miss. Catherine Helen "Carrie"
889                                Behr, Mr. Karl Howell
890                                  Dooley, Mr. Patrick
Name: Name, Length: 891, dtype: object

In [84]:
df.Name

# This is wrong: df.["Name"]
# Instead use either: df["Name"] or df.Name

0                                Braund, Mr. Owen Harris
1      Cumings, Mrs. John Bradley (Florence Briggs Th...
2                                 Heikkinen, Miss. Laina
3           Futrelle, Mrs. Jacques Heath (Lily May Peel)
4                               Allen, Mr. William Henry
                             ...                        
886                                Montvila, Rev. Juozas
887                         Graham, Miss. Margaret Edith
888             Johnston, Miss. Catherine Helen "Carrie"
889                                Behr, Mr. Karl Howell
890                                  Dooley, Mr. Patrick
Name: Name, Length: 891, dtype: object

In [85]:
# df[["Name"]] returns a Pandas DataFrame with a single column named "Name"

df[["Name"]]


Unnamed: 0,Name
0,"Braund, Mr. Owen Harris"
1,"Cumings, Mrs. John Bradley (Florence Briggs Th..."
2,"Heikkinen, Miss. Laina"
3,"Futrelle, Mrs. Jacques Heath (Lily May Peel)"
4,"Allen, Mr. William Henry"
...,...
886,"Montvila, Rev. Juozas"
887,"Graham, Miss. Margaret Edith"
888,"Johnston, Miss. Catherine Helen ""Carrie"""
889,"Behr, Mr. Karl Howell"


In [12]:
# Select all rows and a single column using .loc[] 

df.loc[:, ["Name"]]

Unnamed: 0,Name
0,"Braund, Mr. Owen Harris"
1,"Cumings, Mrs. John Bradley (Florence Briggs Th..."
2,"Heikkinen, Miss. Laina"
3,"Futrelle, Mrs. Jacques Heath (Lily May Peel)"
4,"Allen, Mr. William Henry"
...,...
886,"Montvila, Rev. Juozas"
887,"Graham, Miss. Margaret Edith"
888,"Johnston, Miss. Catherine Helen ""Carrie"""
889,"Behr, Mr. Karl Howell"


In [87]:
# Select all rows and a single column using .loc[] 

df.loc[:, "Name"]

0                                Braund, Mr. Owen Harris
1      Cumings, Mrs. John Bradley (Florence Briggs Th...
2                                 Heikkinen, Miss. Laina
3           Futrelle, Mrs. Jacques Heath (Lily May Peel)
4                               Allen, Mr. William Henry
                             ...                        
886                                Montvila, Rev. Juozas
887                         Graham, Miss. Margaret Edith
888             Johnston, Miss. Catherine Helen "Carrie"
889                                Behr, Mr. Karl Howell
890                                  Dooley, Mr. Patrick
Name: Name, Length: 891, dtype: object

In [88]:
# Select specifc rows and all columns using .loc[] 

df.loc[123:126]
# row index 126 included!


Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
123,124,1,2,"Webber, Miss. Susan",female,32.5,0,0,27267,13.0,E101,S
124,125,0,1,"White, Mr. Percival Wayland",male,54.0,0,1,35281,77.2875,D26,S
125,126,1,3,"Nicola-Yarred, Master. Elias",male,12.0,1,0,2651,11.2417,,C
126,127,0,3,"McMahon, Mr. Martin",male,,0,0,370372,7.75,,Q


In [89]:
# Get Embarked column for rows 123 to 126 inclusive in DataFrame df

df.loc[123:126,['Embarked']]

Unnamed: 0,Embarked
123,S
124,S
125,C
126,Q


In [90]:
# Get Embarked and Pclass columns for rows 123 to 126 in DataFrame df

df.loc[123:126,['Embarked','Pclass']]

Unnamed: 0,Embarked,Pclass
123,S,2
124,S,1
125,C,3
126,Q,3


In [91]:
# Get columns Pclass to SibSp for rows 123 to 126 in DataFrame df

df.loc[123:126,'Pclass':'SibSp']

Unnamed: 0,Pclass,Name,Sex,Age,SibSp
123,2,"Webber, Miss. Susan",female,32.5,0
124,1,"White, Mr. Percival Wayland",male,54.0,0
125,3,"Nicola-Yarred, Master. Elias",male,12.0,1
126,3,"McMahon, Mr. Martin",male,,0


In [92]:
# Select multiple rows and specific columns using .iloc[] 

df.iloc[:9,[3,5]]

Unnamed: 0,Name,Age
0,"Braund, Mr. Owen Harris",22.0
1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",38.0
2,"Heikkinen, Miss. Laina",26.0
3,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",35.0
4,"Allen, Mr. William Henry",35.0
5,"Moran, Mr. James",
6,"McCarthy, Mr. Timothy J",54.0
7,"Palsson, Master. Gosta Leonard",2.0
8,"Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)",27.0


In [93]:
# Comparing loc and iloc
# The main difference between loc and iloc is that loc selects rows and columns by label, 
# while iloc selects rows and columns by integer position.

df.loc[:4, "Name":"Ticket"]
# We are using loc to select the first 5 rows and the columns Name to Ticket, inclusive. 
# The labels for the rows and columns are specified as strings.

Unnamed: 0,Name,Sex,Age,SibSp,Parch,Ticket
0,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171
1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599
2,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282
3,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803
4,"Allen, Mr. William Henry",male,35.0,0,0,373450


In [94]:
# Comparing loc and iloc

df.iloc[:5,3:9]

Unnamed: 0,Name,Sex,Age,SibSp,Parch,Ticket
0,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171
1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599
2,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282
3,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803
4,"Allen, Mr. William Henry",male,35.0,0,0,373450


### Filtering using multiple conditions

* The & symbol represents the logical AND operator in Pandas conditional filtering. It returns True if both of its operands are True, and False otherwise.

* The | symbol represents the logical OR operator in Pandas conditional filtering. It returns True if either of its operands is True, and False only if both of its operands are False.

In [95]:
# Filtering
# Boolean indexing using Boolean mask

# Create DataFrame of female survivors in df
female_survived = df[(df['Sex'] == 'female') & (df['Survived'] == 1)]

female_survived.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
8,9,1,3,"Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)",female,27.0,0,2,347742,11.1333,,S
9,10,1,2,"Nasser, Mrs. Nicholas (Adele Achem)",female,14.0,1,0,237736,30.0708,,C


In [96]:
# Filter passengers who were female and did not survive
female_and_not_survived = df[(df['Sex'] == 'female') & (df['Survived'] == 0)]

female_and_not_survived.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
14,15,0,3,"Vestrom, Miss. Hulda Amanda Adolfina",female,14.0,0,0,350406,7.8542,,S
18,19,0,3,"Vander Planke, Mrs. Julius (Emelia Maria Vande...",female,31.0,1,0,345763,18.0,,S
24,25,0,3,"Palsson, Miss. Torborg Danira",female,8.0,3,1,349909,21.075,,S
38,39,0,3,"Vander Planke, Miss. Augusta Maria",female,18.0,2,0,345764,18.0,,S
40,41,0,3,"Ahlin, Mrs. Johan (Johanna Persdotter Larsson)",female,40.0,1,0,7546,9.475,,S


In [16]:
maxages = df.groupby('Sex')['Age'].max()
maxages

Sex
female    63.0
male      80.0
Name: Age, dtype: float64

In [17]:
minages = df.groupby('Sex')['Age'].min()
minages

Sex
female    0.75
male      0.42
Name: Age, dtype: float64

### How to Use Groupby and Aggregate Functions in Pandas

Groupby is a powerful pandas operation that allows you to group data together based on one or more columns. This can be useful for a variety of tasks, such as calculating summary statistics for each group, identifying trends, and detecting outliers.

Once you have grouped your data, you can use a variety of aggregate functions to calculate summary statistics for each group. Aggregate functions are a special type of pandas function that can be used to calculate summary statistics for a group of data. 

In [97]:
# What was the average age of the passengers in each Passenger Class (Pclass) on the Titanic?

df.groupby('Pclass').Age.mean()

# Aggregate functions:
# mean(): calculates the average value of a group of data.
# sum(): adds up all the values in a group of data.
# min(): finds the smallest value in a group of data.
# max(): finds the largest value in a group of data.
# count(): counts the number of values in a group of data.
# median(): finds the middle value in a group of data.
# std(): calculates the standard deviation of a group of data. 
# var(): calculates the variance of a group of data.


Pclass
1    38.233441
2    29.877630
3    25.140620
Name: Age, dtype: float64

In [98]:
# What are the ages of the oldest female and oldest male passengers aboard the Titanic?

df.groupby("Sex").Age.max()


Sex
female    63.0
male      80.0
Name: Age, dtype: float64

In [99]:
# What is the average fare paid by surviving passengers grouped by sex?

df[df['Survived'] == 1].groupby('Sex').Fare.mean()

Sex
female    51.938573
male      40.821484
Name: Fare, dtype: float64