**Exploring Pandas functions on the titanic dataset**

In [4]:
import pandas as pd
path = './titanic.csv'
df = pd.read_csv(path)

 **1. Head()**
 - Return first 5 rows of the dataset

In [5]:
df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


**2. Tail()**
- Return last 5 rows of the dataset

In [5]:
df.tail()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
886,887,0,2,"Montvila, Rev. Juozas",male,27.0,0,0,211536,13.0,,S
887,888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0,B42,S
888,889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,,1,2,W./C. 6607,23.45,,S
889,890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0,C148,C
890,891,0,3,"Dooley, Mr. Patrick",male,32.0,0,0,370376,7.75,,Q


**3. info()**
- Shows summary of dataframe including data types and non-null counts.

In [6]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB


**4. describe()**
- Provides summary statistics for numerical columns.

In [7]:
df.describe()

Unnamed: 0,PassengerId,Survived,Pclass,Age,SibSp,Parch,Fare
count,891.0,891.0,891.0,714.0,891.0,891.0,891.0
mean,446.0,0.383838,2.308642,29.699118,0.523008,0.381594,32.204208
std,257.353842,0.486592,0.836071,14.526497,1.102743,0.806057,49.693429
min,1.0,0.0,1.0,0.42,0.0,0.0,0.0
25%,223.5,0.0,2.0,20.125,0.0,0.0,7.9104
50%,446.0,0.0,3.0,28.0,0.0,0.0,14.4542
75%,668.5,1.0,3.0,38.0,1.0,0.0,31.0
max,891.0,1.0,3.0,80.0,8.0,6.0,512.3292


**5. shape()**
- Gives the number of rows and columns.

In [9]:
df.shape

(891, 12)

**6. columns()**
- Lists all column names.

In [11]:
df.columns

Index(['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp',
       'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked'],
      dtype='object')

**7. dtype()**
- Shows data type of each column.

In [12]:
df.dtypes

Unnamed: 0,0
PassengerId,int64
Survived,int64
Pclass,int64
Name,object
Sex,object
Age,float64
SibSp,int64
Parch,int64
Ticket,object
Fare,float64


**8. isnull()**
- Detects missing values, returns boolean DataFrame.

In [15]:
# df.isnull()
df.isnull().head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,False,False,False,False,False,False,False,False,False,False,True,False
1,False,False,False,False,False,False,False,False,False,False,False,False
2,False,False,False,False,False,False,False,False,False,False,True,False
3,False,False,False,False,False,False,False,False,False,False,False,False
4,False,False,False,False,False,False,False,False,False,False,True,False


**9. sum()**
- Sum of values (can be used to count missing values if combined with isnull).

In [16]:
df.isnull().sum()

Unnamed: 0,0
PassengerId,0
Survived,0
Pclass,0
Name,0
Sex,0
Age,177
SibSp,0
Parch,0
Ticket,0
Fare,0


**10. dropna()**
- Remove rows with missing values.

In [18]:
df.dropna().shape

(183, 12)

**11. fillna()**
- Fill missing values with a specified value.

In [22]:
df['Age'].fillna(0).head()

Unnamed: 0,Age
0,22.0
1,38.0
2,26.0
3,35.0
4,35.0


**12. unique()**
- Shows unique values in a column.

In [26]:
df['Embarked'].unique()

array(['S', 'C', 'Q', nan], dtype=object)

**13. value_count()**
- Counts frequency of unique values.

In [27]:
df['Pclass'].value_counts()

Unnamed: 0_level_0,count
Pclass,Unnamed: 1_level_1
3,491
1,216
2,184


**14. sort_values()**
- Sort data by a column.

In [30]:
df.sort_values(by='Age', ascending=False).head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
630,631,1,1,"Barkworth, Mr. Algernon Henry Wilson",male,80.0,0,0,27042,30.0,A23,S
851,852,0,3,"Svensson, Mr. Johan",male,74.0,0,0,347060,7.775,,S
96,97,0,1,"Goldschmidt, Mr. George B",male,71.0,0,0,PC 17754,34.6542,A5,C
493,494,0,1,"Artagaveytia, Mr. Ramon",male,71.0,0,0,PC 17609,49.5042,,C
116,117,0,3,"Connors, Mr. Patrick",male,70.5,0,0,370369,7.75,,Q


**15. groupby()**
- Group data and apply aggregation functions.

In [33]:
df.groupby('Pclass')['Age'].mean()

Unnamed: 0_level_0,Age
Pclass,Unnamed: 1_level_1
1,37.048118
2,29.866958
3,26.403259


**16. agg()**
- Multiple aggregation functions at once.

In [34]:
df.groupby('Pclass')['Age'].agg(['mean', 'max', 'min'])

Unnamed: 0_level_0,mean,max,min
Pclass,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1,37.048118,80.0,0.92
2,29.866958,70.0,0.67
3,26.403259,74.0,0.42


**17. loc[]**
- Select rows and columns by label.

In [37]:
df.loc[0:5, ['Name', 'Age']]
# df.loc[600:605, ['Name', 'Age']]

Unnamed: 0,Name,Age
0,"Braund, Mr. Owen Harris",22.0
1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",38.0
2,"Heikkinen, Miss. Laina",26.0
3,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",35.0
4,"Allen, Mr. William Henry",35.0
5,"Moran, Mr. James",29.699118


**18. iloc[]**
- Select rows and columns by integer position.

In [39]:
df.iloc[0:5, 3:6]

Unnamed: 0,Name,Sex,Age
0,"Braund, Mr. Owen Harris",male,22.0
1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0
2,"Heikkinen, Miss. Laina",female,26.0
3,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0
4,"Allen, Mr. William Henry",male,35.0


**19. drop()**
- Drop rows or columns by label.

In [40]:
df.drop(columns=['PassengerId', 'Name']).head()

Unnamed: 0,Survived,Pclass,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,0,3,male,22.0,1,0,A/5 21171,7.25,,S
1,1,1,female,38.0,1,0,PC 17599,71.2833,C85,C
2,1,3,female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,1,1,female,35.0,1,0,113803,53.1,C123,S
4,0,3,male,35.0,0,0,373450,8.05,,S


**20. Rename()**
- Rename columns.

In [42]:
df.rename(columns={'Pclass': 'Passenger Class'}).head()

Unnamed: 0,PassengerId,Survived,Passenger Class,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


**21. astype()**
- Change data type of a column.

In [48]:
# df.astype({'Age': 'float64'}).dtypes
df['Pclass'] = df['Pclass'].astype('category')
print(df['Pclass'])

0        3
1        1
2        3
3        1
4        3
      ... 
886    NaN
887    NaN
888    NaN
889    NaN
890    NaN
Name: Pclass, Length: 891, dtype: category
Categories (3, int64): [1, 2, 3]


**22. apply()**
- Apply a function to a column or row.

In [49]:
df['Name_length'] = df['Name'].apply(len)
df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Name_length
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S,23
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C,51
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S,22
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S,44
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S,24


**23. duplicate()**
- Check for duplicate rows.

In [51]:
print(df.duplicated().sum())

0


**24. drop_duplicate()**
- Remove duplicate rows.

In [53]:
df.drop_duplicates().shape

(891, 13)

**25. sample()**

In [54]:
df.sample(5)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Name_length
806,807,0,,"Andrews, Mr. Thomas Jr",male,39.0,0,0,112050,0.0,A36,S,22
690,691,1,,"Dick, Mr. Albert Adrian",male,31.0,1,0,17474,57.0,B20,S,23
560,561,0,,"Morrow, Mr. Thomas Rowan",male,29.699118,0,0,372622,7.75,,Q,24
877,878,0,,"Petroff, Mr. Nedelio",male,19.0,0,0,349212,7.8958,,S,20
81,82,1,,"Sheerlinck, Mr. Jan Baptist",male,29.0,0,0,345779,9.5,,S,27


# Summary: Pandas Functions and Titanic Dataset

## About Pandas

**Pandas is a powerful Python library widely used for data manipulation and analysis. It provides data structures like DataFrame and Series, allowing for easy handling of structured data.**

**Key Pandas Functions for Data Analysis**

| Function            | Purpose                                   | Example                                            |
| ------------------- | ----------------------------------------- | -------------------------------------------------- |
| `head()`            | View first few rows of the dataset        | `df.head()`                                        |
| `tail()`            | View last few rows                        | `df.tail()`                                        |
| `info()`            | Summary of data types and non-null counts | `df.info()`                                        |
| `describe()`        | Statistical summary of numerical columns  | `df.describe()`                                    |
| `shape`             | Get dataset dimensions (rows, columns)    | `df.shape`                                         |
| `columns`           | List column names                         | `df.columns`                                       |
| `dtypes`            | Show data types of columns                | `df.dtypes`                                        |
| `isnull()`          | Identify missing values                   | `df.isnull().sum()`                                |
| `dropna()`          | Remove rows with missing data             | `df.dropna()`                                      |
| `fillna()`          | Fill missing values with a specific value | `df['Age'].fillna(df['Age'].mean(), inplace=True)` |
| `unique()`          | Find unique values in a column            | `df['Embarked'].unique()`                          |
| `value_counts()`    | Count occurrences of unique values        | `df['Pclass'].value_counts()`                      |
| `sort_values()`     | Sort rows by column values                | `df.sort_values('Age')`                            |
| `groupby()`         | Group data for aggregation                | `df.groupby('Pclass')['Age'].mean()`               |
| `agg()`             | Apply multiple aggregations               | `df.groupby('Pclass')['Age'].agg(['mean','max'])`  |
| `loc[]`             | Select rows/columns by label              | `df.loc[0:5, ['Name','Age']]`                      |
| `iloc[]`            | Select rows/columns by integer position   | `df.iloc[0:5, 0:3]`                                |
| `drop()`            | Remove rows/columns                       | `df.drop('Ticket', axis=1)`                        |
| `rename()`          | Rename columns                            | `df.rename(columns={'Pclass':'Class'})`            |
| `astype()`          | Change data type                          | `df['Pclass'] = df['Pclass'].astype('category')`   |
| `apply()`           | Apply a function to a column/row          | `df['Name_length'] = df['Name'].apply(len)`        |
| `duplicated()`      | Detect duplicate rows                     | `df.duplicated().sum()`                            |
| `drop_duplicates()` | Remove duplicates                         | `df.drop_duplicates()`                             |
| `sample()`          | Random sample of rows                     | `df.sample(5)`                                     |


## About the Titanic Dataset

**The Titanic dataset is one of the most popular datasets for beginner data science and machine learning projects. It contains data about the passengers who were on the Titanic when it sank in 1912.**

**Key columns:**

- PassengerId: Unique ID for each passenger

- Survived: Survival status (0 = No, 1 = Yes)

- Pclass: Passenger class (1 = 1st, 2 = 2nd, 3 = 3rd)

- Name: Passenger’s name

- Sex: Gender

- Age: Age in years

- SibSp: Number of siblings/spouses aboard

- Parch: Number of parents/children aboard

- Ticket: Ticket number

- Fare: Passenger fare

- Cabin: Cabin number

- Embarked: Port of embarkation (C = Cherbourg, Q = Queenstown, S = Southampton)

**Why Use This Dataset?**

Real-world data with missing values and categorical variables

Suitable for practicing data cleaning, exploration, and basic analysis

Commonly used for classification problems (predicting survival)