<a href="https://colab.research.google.com/github/saad-ameer/Python-for-Data-Analyst/blob/main/Pandas.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Pandas Notes – Introduction

### * What is Pandas?

* Pandas is a powerful Python library for **data analysis** and **data manipulation**
* Built on top of NumPy
* Designed to work with **tabular or labeled data**

---

### * Key Features of Pandas

* Read and write data from multiple formats (CSV, Excel, SQL, etc.)
* Handle **large datasets** efficiently
* Support **label-based slicing**, indexing, and subsetting
* Detect and handle **missing data**
* Clean and preprocess data easily
* Insert/delete rows and columns
* Align data across multiple labels
* **Reshape** and **pivot** data
* Perform **aggregations** using `groupby()`
* Merge and join different datasets
* Work with **time series** data
* Perform **rolling/window** operations

---

### * Main Data Structures in Pandas

* `Series`: One-dimensional labeled array (like a column in a table)
  * Each element has a corresponding **index**
  * Similar to NumPy arrays but with labels

* `DataFrame`: Two-dimensional labeled data structure
  * Collection of `Series` aligned in rows and columns
  * Like a table in a spreadsheet or SQL

---

### * Why Use Pandas?

* Simple, readable syntax for complex data tasks
* Efficient handling of tabular data
* Ideal for real-world, practical data analysis

---

### * Summary

* Pandas is a high-level tool for handling real-world data
* Offers intuitive and powerful tools for loading, cleaning, manipulating, analyzing, and visualizing structured data
* Next: Learn about `Series` and `DataFrame` in detail

## Pandas Notes – Creating a Series

### * What is a Series?

* A `Series` is a one-dimensional labeled array in pandas
* Similar to NumPy arrays, but each value has an associated **index**
* Can hold different data types (e.g. int, float, string)

---

### * Syntax to Create a Series

```python
pd.Series(data, index, dtype, name)
```

* `data`: array-like (list, tuple, NumPy array, dict, or scalar)
* `index`: array-like (optional; if not given, auto-generated as 0, 1, 2, ...)
* `dtype`: data type of values (optional)
* `name`: optional label for the Series

---

### * Basic Examples

```python
import pandas as pd

data = [100, 200, 300, 400, 500]
index = [0, 1, 2, 3, 4]
s = pd.Series(data, index)
```

* You can access values using index labels:
  * `s[3]` → `400`
* You can also pass data as tuple or NumPy array

---

### * Using Custom Indexes

```python
months = ['Jan', 'Feb', 'Mar', 'Apr']
revenue = (1000, 1200, 1300, 900)
rev_series = pd.Series(revenue, index=months)
```

* Access with label:
  * `rev_series['Feb']` → `1200`
* Access multiple:
  * `rev_series[['Jan', 'Mar']]`

---

### * Data Type (dtype) Handling

* Pandas will infer the data type, or you can specify it:
  * `pd.Series(data, index, dtype='float')`
  * Long double: `dtype='g'`

* If values have mixed types, dtype becomes `object`
* Setting dtype to numeric when mixed types → Error

---

### * Auto Index and Empty Series

* If `index` not specified → auto index (0, 1, 2, ...)
* If only `index` given, data is filled with `NaN`

```python
pd.Series(index=['a', 'b', 'c'])
```

---

### * Creating Series from Dictionary

```python
d = {'a': 10, 'b': 20, 'c': 30}
pd.Series(d)
```

* Dictionary keys become index
* Dictionary values become data
* You can also specify the index explicitly to reorder or subset:
  * `pd.Series(d, index=['b', 'a'])`

---

### * Summary

* Series are like labeled NumPy arrays
* They support various data types and allow powerful indexing
* Can be created from lists, tuples, arrays, dictionaries, or scalars

In [346]:
import pandas as pd

In [347]:
index_list = [0,1,2,3,4,5]

In [348]:
data_list = [100,200,300,400,500,600]

In [349]:
series_1 = pd.Series(data=data_list, index=index_list)

In [350]:
series_1

Unnamed: 0,0
0,100
1,200
2,300
3,400
4,500
5,600


In [351]:
series_1[4]

np.int64(500)

In [352]:
series_1[[4]]

Unnamed: 0,0
4,500


In [353]:
month_index_2025 = ['Jan','Feb','Mar','Apr','May','Jun','Jul','Aug','Sep','Oct','Nov','Dec']

In [354]:
revenue_data_2025 = (1000000,2000000,3000000,4000000,5000000,6000000,7000000,8000000,9000000,10000000,11000000,12000000)

In [355]:
revenue_series = pd.Series(data=revenue_data_2025, index=month_index_2025)

In [356]:
revenue_series

Unnamed: 0,0
Jan,1000000
Feb,2000000
Mar,3000000
Apr,4000000
May,5000000
Jun,6000000
Jul,7000000
Aug,8000000
Sep,9000000
Oct,10000000


In [357]:
type(revenue_series)

In [358]:
revenue_series['Jul']

np.int64(7000000)

In [359]:
revenue_series[['Jul']]

Unnamed: 0,0
Jul,7000000


In [360]:
revenue_series[['Jul','Aug','Sep']]

Unnamed: 0,0
Jul,7000000
Aug,8000000
Sep,9000000


In [361]:
revenue_series = pd.Series(data=revenue_data_2025, index=month_index_2025,dtype='float')

In [362]:
revenue_series

Unnamed: 0,0
Jan,1000000.0
Feb,2000000.0
Mar,3000000.0
Apr,4000000.0
May,5000000.0
Jun,6000000.0
Jul,7000000.0
Aug,8000000.0
Sep,9000000.0
Oct,10000000.0


In [363]:
dict_1={'key1':'value1','key2':'value2','key3':'value3','key4':'value4'}

In [364]:
pd.Series(data=dict_1,index=dict_1)

Unnamed: 0,0
key1,value1
key2,value2
key3,value3
key4,value4


## Pandas Notes – DataFrames

### * What is a DataFrame?

* A DataFrame is a 2D labeled data structure in pandas (rows and columns)
* Think of it like a table or spreadsheet
* Internally, it's a collection of Series sharing the same index

---

### * Creating a DataFrame

```python
import pandas as pd

data = [
    [10, 20, 30],
    [40, 50, 60],
    [70, 80, 90],
    [100, 110, 120],
    [130, 140, 150]
]
columns = ['A', 'B', 'C']
index = [0, 1, 2, 3, 4]

df = pd.DataFrame(data, columns=columns, index=index)
```

* `data`: 2D array-like structure (lists, tuples, NumPy arrays)
* `columns`: list of column names (length must match number of columns)
* `index`: list of row labels (length must match number of rows)

---

### * Indexing with Columns

```python
df.set_index('C')
```

* Sets column `C` as the new index (drops `C` by default)
* To **keep** the column: `df.set_index('C', drop=False)`
* To make the change **in-place**: `df.set_index('C', inplace=True)`
* To **append** to existing index: `df.set_index('C', append=True)`

---

### * Resetting the Index

```python
df.reset_index()
```

* Resets the index to default (0, 1, 2, …)
* To **drop** the current index completely: `df.reset_index(drop=True)`
* To make the reset **in-place**: `df.reset_index(inplace=True)`

---

### * Creating DataFrame from Sets, Tuples

```python
data = [
    {1, 2, 3},
    (4, 5, 6),
    [7, 8, 9],
    [10, 11, 12],
    [13, 14, 15]
]
df = pd.DataFrame(data, columns=['A', 'B', 'C'])
```

* Lists of mixed structures (sets, tuples, lists) work fine if shapes align

---

### * Reading CSV into DataFrame

```python
df = pd.read_csv('filename.csv')
```

* `pd.read_csv()` reads a CSV file into a DataFrame
* If CSV has no header row:
  * `pd.read_csv('file.csv', header=None)`
* If the file is in another folder, use full path:
  * `pd.read_csv('/Users/Name/Desktop/file.csv')`
* Common arguments:
  * `sep=','` → specify separator (default is comma)
  * `index_col=0` → use first column as index
  * `usecols=[0, 2]` or `usecols=['Country', 'Population']`

---

### * Summary

* DataFrames are 2D labeled tables, great for structured data
* Can be created from lists, sets, NumPy arrays, dictionaries, or CSV files
* `set_index()` and `reset_index()` let you control row labels
* Use `pd.read_csv()` to import data files as DataFrames

In [365]:
import pandas as pd

In [366]:
row_labels = ['a','b','c','d','e']
column_labels = ['A','B','C']
data = [[2,4,1],[5,3,1],[7,3,7],[2,6,9],[1,6,8]]

In [367]:
data = [[2,4,1],(5,3,1),{7,3,1},[2,6,9],[1,6,8]]

In [368]:
df =pd.DataFrame(data=data,index=row_labels,columns=column_labels)

In [369]:
df

Unnamed: 0,A,B,C
a,2,4,1
b,5,3,1
c,1,3,7
d,2,6,9
e,1,6,8


In [370]:
type(df)

In [371]:
df.set_index('A')

Unnamed: 0_level_0,B,C
A,Unnamed: 1_level_1,Unnamed: 2_level_1
2,4,1
5,3,1
1,3,7
2,6,9
1,6,8


In [372]:
df.set_index('A',drop=False)

Unnamed: 0_level_0,A,B,C
A,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2,2,4,1
5,5,3,1
1,1,3,7
2,2,6,9
1,1,6,8


In [373]:
df

Unnamed: 0,A,B,C
a,2,4,1
b,5,3,1
c,1,3,7
d,2,6,9
e,1,6,8


In [374]:
df.set_index('A',inplace=True)

In [375]:
df

Unnamed: 0_level_0,B,C
A,Unnamed: 1_level_1,Unnamed: 2_level_1
2,4,1
5,3,1
1,3,7
2,6,9
1,6,8


In [376]:
df.set_index('B',drop=False,append=True)

Unnamed: 0_level_0,Unnamed: 1_level_0,B,C
A,B,Unnamed: 2_level_1,Unnamed: 3_level_1
2,4,4,1
5,3,3,1
1,3,3,7
2,6,6,9
1,6,6,8


In [377]:
df.reset_index(inplace=True)

In [378]:
df

Unnamed: 0,A,B,C
0,2,4,1
1,5,3,1
2,1,3,7
3,2,6,9
4,1,6,8


In [379]:
countries = pd.read_csv(filepath_or_buffer='top_10_countries.csv')

In [380]:
countries

Unnamed: 0,Rank,Country / Dependency,Region,Population,% of world,Date
0,1,China,Asia,1412600000,17.80%,31-Dec-21
1,2,India,Asia,1386946912,17.50%,18-Jan-22
2,3,United States,Americas,333073186,4.20%,18-Jan-22
3,4,Indonesia,Asia,271350000,3.42%,31-Dec-20
4,5,Pakistan,Asia,225200000,2.84%,01-Jul-21
5,6,Brazil,Americas,214231641,2.70%,18-Jan-22
6,7,Nigeria,Africa,211401000,2.67%,01-Jul-21
7,8,Bangladesh,Asia,172062576,2.17%,18-Jan-22
8,9,Russia,Europe,146171015,1.84%,01-Jan-21
9,10,Mexico,Americas,126014024,1.59%,02-Mar-20


In [381]:
countries = pd.read_csv('top_10_countries.csv')

In [382]:
countries

Unnamed: 0,Rank,Country / Dependency,Region,Population,% of world,Date
0,1,China,Asia,1412600000,17.80%,31-Dec-21
1,2,India,Asia,1386946912,17.50%,18-Jan-22
2,3,United States,Americas,333073186,4.20%,18-Jan-22
3,4,Indonesia,Asia,271350000,3.42%,31-Dec-20
4,5,Pakistan,Asia,225200000,2.84%,01-Jul-21
5,6,Brazil,Americas,214231641,2.70%,18-Jan-22
6,7,Nigeria,Africa,211401000,2.67%,01-Jul-21
7,8,Bangladesh,Asia,172062576,2.17%,18-Jan-22
8,9,Russia,Europe,146171015,1.84%,01-Jan-21
9,10,Mexico,Americas,126014024,1.59%,02-Mar-20


In [383]:
type(countries)

In [384]:
countries2 = pd.read_csv('top_10_countries_no_header.csv', header=None)

In [385]:
countries2

Unnamed: 0,0,1,2,3,4,5
0,1,China,Asia,1412600000,17.80%,31-Dec-21
1,2,India,Asia,1386946912,17.50%,18-Jan-22
2,3,United States,Americas,333073186,4.20%,18-Jan-22
3,4,Indonesia[b],Asia,271350000,3.42%,31-Dec-20
4,5,Pakistan,Asia,225200000,2.84%,01-Jul-21
5,6,Brazil,Americas,214231641,2.70%,18-Jan-22
6,7,Nigeria,Africa,211401000,2.67%,01-Jul-21
7,8,Bangladesh,Asia,172062576,2.17%,18-Jan-22
8,9,Russia[b],Europe,146171015,1.84%,01-Jan-21
9,10,Mexico,Americas,126014024,1.59%,02-Mar-20


## Pandas Notes – Selecting and Filtering DataFrames

### * Selecting Columns

```python
# Single column (returns a Series)
df['Region']

# Multiple columns (returns a DataFrame)
df[['Region', 'Country / Dependency', 'Rank']]
```

* Use single square brackets for one column  
* Use list inside double brackets for multiple columns  
* Columns are strings (quoted)

---

### * Selecting Rows with `.iloc` (Index-based)

```python
# Row at index 3
df.iloc[3]

# Rows from index 3 onwards
df.iloc[3:]

# Reverse all rows
df.iloc[::-1]

# Single value: population of row index 2, column index 3
df.iloc[2, 3]

# Rows from index 2 onward, columns index 1 to 3 (exclusive)
df.iloc[2:, 1:4]
```

* `.iloc` uses **integer positions**  
* Format: `df.iloc[rows, columns]`  
* Use colon `:` to slice

---

### * Selecting Rows/Columns with `.loc` (Label-based)

```python
# Row with label 2, column with label 'Country / Dependency'
df.loc[2, 'Country / Dependency']

# All rows, from column 'Country / Dependency' onwards
df.loc[:, 'Country / Dependency':]

# Select multiple named columns
df.loc[:, ['Population', 'Country / Dependency']]
```

* `.loc` uses **label names** (not integer indexes)  
* Format: `df.loc[rows, columns]`

---

### * Filtering Rows with Conditions

```python
# Where Region == 'Asia'
df[df['Region'] == 'Asia']

# Where Region == 'Asia' AND Population > 300_000_000
df[(df['Region'] == 'Asia') & (df['Population'] > 300_000_000)]

# Where Region == 'Asia' OR Population > 300_000_000
df[(df['Region'] == 'Asia') | (df['Population'] > 300_000_000)]
```

* Use `&` for AND and `|` for OR  
* Each condition must be in parentheses

---

### * Filtering Rows + Selecting Columns

```python
# Return Rank and Country/Dependency for filtered data
df[(df['Region'] == 'Asia') & (df['Population'] > 300_000_000)][['Rank', 'Country / Dependency']]
```

---

### * Quick Methods for Data Inspection

```python
df.head()  # First 5 rows
df.tail()  # Last 5 rows
```

* `head(n)` and `tail(n)` can take custom number `n`  
* Useful for previewing large datasets

---

### * Summary

* Use `df[col]` or `df[[col1, col2]]` for column selection  
* Use `.iloc[]` for index-based slicing, `.loc[]` for label-based slicing  
* Filter rows using boolean conditions inside brackets  
* Combine filters with `&` (and), `|` (or)  
* Use `head()` and `tail()` to inspect DataFrames efficiently

In [386]:
import pandas as pd

In [387]:
countries_data = pd.read_csv('top_10_countries.csv')

In [388]:
countries_data

Unnamed: 0,Rank,Country / Dependency,Region,Population,% of world,Date
0,1,China,Asia,1412600000,17.80%,31-Dec-21
1,2,India,Asia,1386946912,17.50%,18-Jan-22
2,3,United States,Americas,333073186,4.20%,18-Jan-22
3,4,Indonesia,Asia,271350000,3.42%,31-Dec-20
4,5,Pakistan,Asia,225200000,2.84%,01-Jul-21
5,6,Brazil,Americas,214231641,2.70%,18-Jan-22
6,7,Nigeria,Africa,211401000,2.67%,01-Jul-21
7,8,Bangladesh,Asia,172062576,2.17%,18-Jan-22
8,9,Russia,Europe,146171015,1.84%,01-Jan-21
9,10,Mexico,Americas,126014024,1.59%,02-Mar-20


In [389]:
countries_data['Region']

Unnamed: 0,Region
0,Asia
1,Asia
2,Americas
3,Asia
4,Asia
5,Americas
6,Africa
7,Asia
8,Europe
9,Americas


In [390]:
type(countries_data['Region'])

In [391]:
countries_data[['Region','Country / Dependency','Population']]

Unnamed: 0,Region,Country / Dependency,Population
0,Asia,China,1412600000
1,Asia,India,1386946912
2,Americas,United States,333073186
3,Asia,Indonesia,271350000
4,Asia,Pakistan,225200000
5,Americas,Brazil,214231641
6,Africa,Nigeria,211401000
7,Asia,Bangladesh,172062576
8,Europe,Russia,146171015
9,Americas,Mexico,126014024


In [392]:
countries_data.iloc[3]

Unnamed: 0,3
Rank,4
Country / Dependency,Indonesia
Region,Asia
Population,271350000
% of world,3.42%
Date,31-Dec-20


In [393]:
countries_data.iloc[4:]

Unnamed: 0,Rank,Country / Dependency,Region,Population,% of world,Date
4,5,Pakistan,Asia,225200000,2.84%,01-Jul-21
5,6,Brazil,Americas,214231641,2.70%,18-Jan-22
6,7,Nigeria,Africa,211401000,2.67%,01-Jul-21
7,8,Bangladesh,Asia,172062576,2.17%,18-Jan-22
8,9,Russia,Europe,146171015,1.84%,01-Jan-21
9,10,Mexico,Americas,126014024,1.59%,02-Mar-20


In [394]:
countries_data.iloc[::-1]

Unnamed: 0,Rank,Country / Dependency,Region,Population,% of world,Date
9,10,Mexico,Americas,126014024,1.59%,02-Mar-20
8,9,Russia,Europe,146171015,1.84%,01-Jan-21
7,8,Bangladesh,Asia,172062576,2.17%,18-Jan-22
6,7,Nigeria,Africa,211401000,2.67%,01-Jul-21
5,6,Brazil,Americas,214231641,2.70%,18-Jan-22
4,5,Pakistan,Asia,225200000,2.84%,01-Jul-21
3,4,Indonesia,Asia,271350000,3.42%,31-Dec-20
2,3,United States,Americas,333073186,4.20%,18-Jan-22
1,2,India,Asia,1386946912,17.50%,18-Jan-22
0,1,China,Asia,1412600000,17.80%,31-Dec-21


In [395]:
countries_data.iloc[0:5:2]

Unnamed: 0,Rank,Country / Dependency,Region,Population,% of world,Date
0,1,China,Asia,1412600000,17.80%,31-Dec-21
2,3,United States,Americas,333073186,4.20%,18-Jan-22
4,5,Pakistan,Asia,225200000,2.84%,01-Jul-21


In [396]:
countries_data.iloc[2,3]

np.int64(333073186)

In [397]:
countries_data.iloc[3:,1:4]

Unnamed: 0,Country / Dependency,Region,Population
3,Indonesia,Asia,271350000
4,Pakistan,Asia,225200000
5,Brazil,Americas,214231641
6,Nigeria,Africa,211401000
7,Bangladesh,Asia,172062576
8,Russia,Europe,146171015
9,Mexico,Americas,126014024


In [398]:
countries_data.loc[5,'Country / Dependency']

'\xa0Brazil'

In [399]:
countries_data.loc[3:,['Country / Dependency','Population']]

Unnamed: 0,Country / Dependency,Population
3,Indonesia,271350000
4,Pakistan,225200000
5,Brazil,214231641
6,Nigeria,211401000
7,Bangladesh,172062576
8,Russia,146171015
9,Mexico,126014024


In [400]:
countries_data=='Asia'

Unnamed: 0,Rank,Country / Dependency,Region,Population,% of world,Date
0,False,False,True,False,False,False
1,False,False,True,False,False,False
2,False,False,False,False,False,False
3,False,False,True,False,False,False
4,False,False,True,False,False,False
5,False,False,False,False,False,False
6,False,False,False,False,False,False
7,False,False,True,False,False,False
8,False,False,False,False,False,False
9,False,False,False,False,False,False


In [401]:
countries_data[countries_data=='Asia']

Unnamed: 0,Rank,Country / Dependency,Region,Population,% of world,Date
0,,,Asia,,,
1,,,Asia,,,
2,,,,,,
3,,,Asia,,,
4,,,Asia,,,
5,,,,,,
6,,,,,,
7,,,Asia,,,
8,,,,,,
9,,,,,,


In [402]:
countries_data['Region']=='Asia'

Unnamed: 0,Region
0,True
1,True
2,False
3,True
4,True
5,False
6,False
7,True
8,False
9,False


In [403]:
countries_data[countries_data['Region']=='Asia']

Unnamed: 0,Rank,Country / Dependency,Region,Population,% of world,Date
0,1,China,Asia,1412600000,17.80%,31-Dec-21
1,2,India,Asia,1386946912,17.50%,18-Jan-22
3,4,Indonesia,Asia,271350000,3.42%,31-Dec-20
4,5,Pakistan,Asia,225200000,2.84%,01-Jul-21
7,8,Bangladesh,Asia,172062576,2.17%,18-Jan-22


In [404]:
countries_data[
    (countries_data['Region'] == 'Asia') |
    (countries_data['Population'].astype(float) > 300000000)
]

Unnamed: 0,Rank,Country / Dependency,Region,Population,% of world,Date
0,1,China,Asia,1412600000,17.80%,31-Dec-21
1,2,India,Asia,1386946912,17.50%,18-Jan-22
2,3,United States,Americas,333073186,4.20%,18-Jan-22
3,4,Indonesia,Asia,271350000,3.42%,31-Dec-20
4,5,Pakistan,Asia,225200000,2.84%,01-Jul-21
7,8,Bangladesh,Asia,172062576,2.17%,18-Jan-22


In [405]:
countries_data[
    (countries_data['Region']=='Asia') &
    (countries_data['Population'].astype(float)>300000000)
][['Region','Country / Dependency']]

Unnamed: 0,Region,Country / Dependency
0,Asia,China
1,Asia,India


In [406]:
countries_data.head()

Unnamed: 0,Rank,Country / Dependency,Region,Population,% of world,Date
0,1,China,Asia,1412600000,17.80%,31-Dec-21
1,2,India,Asia,1386946912,17.50%,18-Jan-22
2,3,United States,Americas,333073186,4.20%,18-Jan-22
3,4,Indonesia,Asia,271350000,3.42%,31-Dec-20
4,5,Pakistan,Asia,225200000,2.84%,01-Jul-21


In [407]:
countries_data.tail()

Unnamed: 0,Rank,Country / Dependency,Region,Population,% of world,Date
5,6,Brazil,Americas,214231641,2.70%,18-Jan-22
6,7,Nigeria,Africa,211401000,2.67%,01-Jul-21
7,8,Bangladesh,Asia,172062576,2.17%,18-Jan-22
8,9,Russia,Europe,146171015,1.84%,01-Jan-21
9,10,Mexico,Americas,126014024,1.59%,02-Mar-20


In [408]:
countries_data.head(3)

Unnamed: 0,Rank,Country / Dependency,Region,Population,% of world,Date
0,1,China,Asia,1412600000,17.80%,31-Dec-21
1,2,India,Asia,1386946912,17.50%,18-Jan-22
2,3,United States,Americas,333073186,4.20%,18-Jan-22


## Pandas Notes – Conditional Selection with `isin()`

### * Select rows where a column matches any value from a list

```python
# Example: Select rows where Region is either 'Asia' or 'Americas'
df[df['Region'].isin(['Asia', 'Americas'])]
```

* `isin()` returns a Boolean Series  
* Can be used inside `df[...]` to filter rows  
* Useful for matching multiple values in a single column

---

### * Select rows where a column **does not** match values in a list

```python
# Example: Select rows where Region is NOT 'Asia' or 'Americas'
df[~df['Region'].isin(['Asia', 'Americas'])]
```

* Use `~` (bitwise NOT) to invert the condition  
* Returns rows where condition is **not** met

---

### * Summary

* `df['col'].isin([val1, val2, ...])` → matches any of the given values  
* `~df['col'].isin([...])` → matches none of the given values  
* Works well for filtering multiple categorical values

In [409]:
countries_data['Region'].isin(['Asia','Europe'])

Unnamed: 0,Region
0,True
1,True
2,False
3,True
4,True
5,False
6,False
7,True
8,True
9,False


In [410]:
countries_data[countries_data['Region'].isin(['Asia','Europe'])]

Unnamed: 0,Rank,Country / Dependency,Region,Population,% of world,Date
0,1,China,Asia,1412600000,17.80%,31-Dec-21
1,2,India,Asia,1386946912,17.50%,18-Jan-22
3,4,Indonesia,Asia,271350000,3.42%,31-Dec-20
4,5,Pakistan,Asia,225200000,2.84%,01-Jul-21
7,8,Bangladesh,Asia,172062576,2.17%,18-Jan-22
8,9,Russia,Europe,146171015,1.84%,01-Jan-21


In [411]:
countries_data[~countries_data['Region'].isin(['Asia','Europe'])]

Unnamed: 0,Rank,Country / Dependency,Region,Population,% of world,Date
2,3,United States,Americas,333073186,4.20%,18-Jan-22
5,6,Brazil,Americas,214231641,2.70%,18-Jan-22
6,7,Nigeria,Africa,211401000,2.67%,01-Jul-21
9,10,Mexico,Americas,126014024,1.59%,02-Mar-20


## Pandas Notes – Data Manipulation Techniques

### Basic Inspection

- `df.shape` → returns (rows, columns)
- `df.columns` → list of column names
- `df.index` → index range

---

### Rename Columns

```python
df.rename(columns={'old_name': 'new_name'}, inplace=True)
```

- Only specify columns you want to rename
- Use `inplace=True` to apply changes directly

---

### Drop Columns

```python
# Drop single or multiple columns
df.drop('col_name', axis=1, inplace=True)
df.drop(['col1', 'col2'], axis=1, inplace=True)
```

- Use `axis=1` for columns
- `inplace=True` to modify DataFrame directly

---

### Arithmetic on Columns

- You can perform arithmetic directly on numerical columns or with scalars

```python
df['population_millions'] = round(df['Population'] / 1_000_000, 2)
df['country_region'] = df['Country'] + ' - ' + df['Region']
```

- Arithmetic only works on compatible data types

---

### TypeError Example

```python
df['Country'] + df['Population']  # will throw error (string + int)
```

---

### Check Data Types

```python
df.dtypes
```

- Use `.astype()` to change data type

---

### Clean and Convert Object to Numeric

Problem: Percentage column stored as string with `%`  
Solution:

1. Remove `%` symbol
2. Convert to float

```python
# Using .apply() with custom function
def remove_percent(x):
    return x[:-1]

df['% of World'] = df['% of World'].apply(remove_percent)

# Or using lambda function
df['% of World'] = df['% of World'].apply(lambda x: x[:-1])
```

```python
# Convert to float
df['% of World'] = df['% of World'].astype(float)
```

---

### Calculate New Column

```python
# Compute world population
df['world_population'] = df['Population'] / (df['% of World'] / 100)
```

You can also break it into steps:

```python
x = df['Population']
y = df['% of World'] / 100
df['world_population'] = x / y
```

---

### Summary

- Use `.rename()`, `.drop()`, and column assignments to manipulate data
- Use `.apply()` or `lambda` to transform column values
- Use `.astype()` for type conversion
- Combine or derive new columns using arithmetic

```python
df['new_col'] = df['col1'] + df['col2']  # or any valid operation
```

---

In [412]:
import pandas as pd

In [413]:
countries_df = pd.read_csv('top_10_countries.csv')

In [414]:
countries_df.shape

(10, 6)

In [415]:
countries_df.columns

Index(['Rank', 'Country / Dependency', 'Region', 'Population', '% of world',
       'Date'],
      dtype='object')

In [416]:
countries_df.index

RangeIndex(start=0, stop=10, step=1)

In [417]:
countries_df.rename(columns={'Country / Dependency' : 'Country'}, inplace=True)

In [418]:
countries_df

Unnamed: 0,Rank,Country,Region,Population,% of world,Date
0,1,China,Asia,1412600000,17.80%,31-Dec-21
1,2,India,Asia,1386946912,17.50%,18-Jan-22
2,3,United States,Americas,333073186,4.20%,18-Jan-22
3,4,Indonesia,Asia,271350000,3.42%,31-Dec-20
4,5,Pakistan,Asia,225200000,2.84%,01-Jul-21
5,6,Brazil,Americas,214231641,2.70%,18-Jan-22
6,7,Nigeria,Africa,211401000,2.67%,01-Jul-21
7,8,Bangladesh,Asia,172062576,2.17%,18-Jan-22
8,9,Russia,Europe,146171015,1.84%,01-Jan-21
9,10,Mexico,Americas,126014024,1.59%,02-Mar-20


In [419]:
#countries_df.drop(labels='Date', axis=1, inplace=True)

In [420]:
countries_df.drop(columns='Date',inplace=True)

In [421]:
countries_df

Unnamed: 0,Rank,Country,Region,Population,% of world
0,1,China,Asia,1412600000,17.80%
1,2,India,Asia,1386946912,17.50%
2,3,United States,Americas,333073186,4.20%
3,4,Indonesia,Asia,271350000,3.42%
4,5,Pakistan,Asia,225200000,2.84%
5,6,Brazil,Americas,214231641,2.70%
6,7,Nigeria,Africa,211401000,2.67%
7,8,Bangladesh,Asia,172062576,2.17%
8,9,Russia,Europe,146171015,1.84%
9,10,Mexico,Americas,126014024,1.59%


In [422]:
countries_df['Population'].astype('float')/10000000

Unnamed: 0,Population
0,141.26
1,138.694691
2,33.307319
3,27.135
4,22.52
5,21.423164
6,21.1401
7,17.206258
8,14.617102
9,12.601402


In [423]:
#round(countries_df['Population'].str.replace(',','').astype(float)/1000000,2)
round(countries_df['Population'].astype(float)/1000000,2)

Unnamed: 0,Population
0,1412.6
1,1386.95
2,333.07
3,271.35
4,225.2
5,214.23
6,211.4
7,172.06
8,146.17
9,126.01


In [424]:
#countries_df['Population (millions)'] = round(countries_df['Population'].str.replace(',','').astype(float)/1000000,2)
countries_df['Population (millions)'] = round(countries_df['Population'].astype(float)/1000000,2)

In [425]:
countries_df

Unnamed: 0,Rank,Country,Region,Population,% of world,Population (millions)
0,1,China,Asia,1412600000,17.80%,1412.6
1,2,India,Asia,1386946912,17.50%,1386.95
2,3,United States,Americas,333073186,4.20%,333.07
3,4,Indonesia,Asia,271350000,3.42%,271.35
4,5,Pakistan,Asia,225200000,2.84%,225.2
5,6,Brazil,Americas,214231641,2.70%,214.23
6,7,Nigeria,Africa,211401000,2.67%,211.4
7,8,Bangladesh,Asia,172062576,2.17%,172.06
8,9,Russia,Europe,146171015,1.84%,146.17
9,10,Mexico,Americas,126014024,1.59%,126.01


In [426]:
countries_df['Country / Region'] = countries_df['Country'] + ' / ' + countries_df['Region']

In [427]:
countries_df

Unnamed: 0,Rank,Country,Region,Population,% of world,Population (millions),Country / Region
0,1,China,Asia,1412600000,17.80%,1412.6,China / Asia
1,2,India,Asia,1386946912,17.50%,1386.95,India / Asia
2,3,United States,Americas,333073186,4.20%,333.07,United States / Americas
3,4,Indonesia,Asia,271350000,3.42%,271.35,Indonesia / Asia
4,5,Pakistan,Asia,225200000,2.84%,225.2,Pakistan / Asia
5,6,Brazil,Americas,214231641,2.70%,214.23,Brazil / Americas
6,7,Nigeria,Africa,211401000,2.67%,211.4,Nigeria / Africa
7,8,Bangladesh,Asia,172062576,2.17%,172.06,Bangladesh / Asia
8,9,Russia,Europe,146171015,1.84%,146.17,Russia / Europe
9,10,Mexico,Americas,126014024,1.59%,126.01,Mexico / Americas


In [428]:
countries_df['Country'] + ' / ' + countries_df['Region']

Unnamed: 0,0
0,China / Asia
1,India / Asia
2,United States / Americas
3,Indonesia / Asia
4,Pakistan / Asia
5,Brazil / Americas
6,Nigeria / Africa
7,Bangladesh / Asia
8,Russia / Europe
9,Mexico / Americas


In [429]:
countries_df.dtypes

Unnamed: 0,0
Rank,int64
Country,object
Region,object
Population,int64
% of world,object
Population (millions),float64
Country / Region,object


In [430]:
 countries_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10 entries, 0 to 9
Data columns (total 7 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   Rank                   10 non-null     int64  
 1   Country                10 non-null     object 
 2   Region                 10 non-null     object 
 3   Population             10 non-null     int64  
 4   % of world             10 non-null     object 
 5   Population (millions)  10 non-null     float64
 6   Country / Region       10 non-null     object 
dtypes: float64(1), int64(2), object(4)
memory usage: 692.0+ bytes


In [431]:
countries_df['% of world']

Unnamed: 0,% of world
0,17.80%
1,17.50%
2,4.20%
3,3.42%
4,2.84%
5,2.70%
6,2.67%
7,2.17%
8,1.84%
9,1.59%


In [432]:
countries_df['% of world']

Unnamed: 0,% of world
0,17.80%
1,17.50%
2,4.20%
3,3.42%
4,2.84%
5,2.70%
6,2.67%
7,2.17%
8,1.84%
9,1.59%


In [433]:
'17.80%'[:-1]

'17.80'

In [434]:
'17.80%'[:len('17.80%')-1]

'17.80'

In [435]:
def remove_percent(x):
  return x[:-1]

In [436]:
remove_percent('17.80%')

'17.80'

In [437]:
countries_df['% of world'].apply(remove_percent)

Unnamed: 0,% of world
0,17.8
1,17.5
2,4.2
3,3.42
4,2.84
5,2.7
6,2.67
7,2.17
8,1.84
9,1.59


In [438]:
countries_df['% of world'].apply(lambda x: x[:-1])

Unnamed: 0,% of world
0,17.8
1,17.5
2,4.2
3,3.42
4,2.84
5,2.7
6,2.67
7,2.17
8,1.84
9,1.59


In [439]:
countries_df['% of world'] = countries_df['% of world'].apply(lambda x: x[:-1])

In [440]:
countries_df

Unnamed: 0,Rank,Country,Region,Population,% of world,Population (millions),Country / Region
0,1,China,Asia,1412600000,17.8,1412.6,China / Asia
1,2,India,Asia,1386946912,17.5,1386.95,India / Asia
2,3,United States,Americas,333073186,4.2,333.07,United States / Americas
3,4,Indonesia,Asia,271350000,3.42,271.35,Indonesia / Asia
4,5,Pakistan,Asia,225200000,2.84,225.2,Pakistan / Asia
5,6,Brazil,Americas,214231641,2.7,214.23,Brazil / Americas
6,7,Nigeria,Africa,211401000,2.67,211.4,Nigeria / Africa
7,8,Bangladesh,Asia,172062576,2.17,172.06,Bangladesh / Asia
8,9,Russia,Europe,146171015,1.84,146.17,Russia / Europe
9,10,Mexico,Americas,126014024,1.59,126.01,Mexico / Americas


In [441]:
countries_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10 entries, 0 to 9
Data columns (total 7 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   Rank                   10 non-null     int64  
 1   Country                10 non-null     object 
 2   Region                 10 non-null     object 
 3   Population             10 non-null     int64  
 4   % of world             10 non-null     object 
 5   Population (millions)  10 non-null     float64
 6   Country / Region       10 non-null     object 
dtypes: float64(1), int64(2), object(4)
memory usage: 692.0+ bytes


In [442]:
countries_df.dtypes

Unnamed: 0,0
Rank,int64
Country,object
Region,object
Population,int64
% of world,object
Population (millions),float64
Country / Region,object


In [443]:
countries_df['% of world'].astype(float)

Unnamed: 0,% of world
0,17.8
1,17.5
2,4.2
3,3.42
4,2.84
5,2.7
6,2.67
7,2.17
8,1.84
9,1.59


In [444]:
countries_df['% of world'] = countries_df['% of world'].astype(float)

In [445]:
countries_df

Unnamed: 0,Rank,Country,Region,Population,% of world,Population (millions),Country / Region
0,1,China,Asia,1412600000,17.8,1412.6,China / Asia
1,2,India,Asia,1386946912,17.5,1386.95,India / Asia
2,3,United States,Americas,333073186,4.2,333.07,United States / Americas
3,4,Indonesia,Asia,271350000,3.42,271.35,Indonesia / Asia
4,5,Pakistan,Asia,225200000,2.84,225.2,Pakistan / Asia
5,6,Brazil,Americas,214231641,2.7,214.23,Brazil / Americas
6,7,Nigeria,Africa,211401000,2.67,211.4,Nigeria / Africa
7,8,Bangladesh,Asia,172062576,2.17,172.06,Bangladesh / Asia
8,9,Russia,Europe,146171015,1.84,146.17,Russia / Europe
9,10,Mexico,Americas,126014024,1.59,126.01,Mexico / Americas


In [446]:
countries_df.dtypes

Unnamed: 0,0
Rank,int64
Country,object
Region,object
Population,int64
% of world,float64
Population (millions),float64
Country / Region,object


In [447]:
#countries_df['Population'] = countries_df['Population'].str.replace(',','').astype(float)
countries_df['Population'] = countries_df['Population'].astype(float)

In [448]:
countries_df['World Population'] = countries_df['Population']/(countries_df['% of world']/100)

In [449]:
countries_df

Unnamed: 0,Rank,Country,Region,Population,% of world,Population (millions),Country / Region,World Population
0,1,China,Asia,1412600000.0,17.8,1412.6,China / Asia,7935955000.0
1,2,India,Asia,1386947000.0,17.5,1386.95,India / Asia,7925411000.0
2,3,United States,Americas,333073200.0,4.2,333.07,United States / Americas,7930314000.0
3,4,Indonesia,Asia,271350000.0,3.42,271.35,Indonesia / Asia,7934211000.0
4,5,Pakistan,Asia,225200000.0,2.84,225.2,Pakistan / Asia,7929577000.0
5,6,Brazil,Americas,214231600.0,2.7,214.23,Brazil / Americas,7934505000.0
6,7,Nigeria,Africa,211401000.0,2.67,211.4,Nigeria / Africa,7917640000.0
7,8,Bangladesh,Asia,172062600.0,2.17,172.06,Bangladesh / Asia,7929151000.0
8,9,Russia,Europe,146171000.0,1.84,146.17,Russia / Europe,7944077000.0
9,10,Mexico,Americas,126014000.0,1.59,126.01,Mexico / Americas,7925410000.0


In [450]:
x = countries_df['Population']

In [451]:
y = countries_df['% of world']/100

In [452]:
countries_df['World Population'] = x/y

In [453]:
countries_df

Unnamed: 0,Rank,Country,Region,Population,% of world,Population (millions),Country / Region,World Population
0,1,China,Asia,1412600000.0,17.8,1412.6,China / Asia,7935955000.0
1,2,India,Asia,1386947000.0,17.5,1386.95,India / Asia,7925411000.0
2,3,United States,Americas,333073200.0,4.2,333.07,United States / Americas,7930314000.0
3,4,Indonesia,Asia,271350000.0,3.42,271.35,Indonesia / Asia,7934211000.0
4,5,Pakistan,Asia,225200000.0,2.84,225.2,Pakistan / Asia,7929577000.0
5,6,Brazil,Americas,214231600.0,2.7,214.23,Brazil / Americas,7934505000.0
6,7,Nigeria,Africa,211401000.0,2.67,211.4,Nigeria / Africa,7917640000.0
7,8,Bangladesh,Asia,172062600.0,2.17,172.06,Bangladesh / Asia,7929151000.0
8,9,Russia,Europe,146171000.0,1.84,146.17,Russia / Europe,7944077000.0
9,10,Mexico,Americas,126014000.0,1.59,126.01,Mexico / Americas,7925410000.0


## Pandas Notes – More Data Manipulation Techniques

### Load Data

```python
import pandas as pd
df = pd.read_csv('TfL_daily_cycle.csv')  # Load CSV
```

---

### Clean Up Unwanted Columns

```python
df.drop('Unnamed: 2', axis=1, inplace=True)
```

---

### Convert to Datetime

#### Method 1: Using `astype`

```python
df['Date'] = df['Date'].astype('datetime64')
```

#### Method 2: Using `pd.to_datetime` with format codes (more reliable)

```python
df['Date'] = pd.to_datetime(df['Date'], format='%d/%m/%Y')
```

- `%d` = Day (2-digit)
- `%m` = Month (2-digit)
- `%Y` = Year (4-digit)

---

### Sort Values

```python
df.sort_values(by='Total Cycle Hire', ascending=False, inplace=True)
```

- Sorts values by column
- Can be ascending or descending

---

### View Top or Bottom Rows

```python
df.head()  # Top 5 rows
df.tail()  # Bottom 5 rows
```

---

### Extract Components from Dates

```python
from datetime import datetime as dt

# Extract month-year from Date
df['Month/Year'] = df['Date'].dt.strftime('%m-%y')  # e.g., 07-21
df['Month/Year'] = df['Date'].dt.strftime('%b-%y')  # e.g., Jul-21
```

---

### Transpose DataFrame

```python
df.T  # Transpose (rows become columns)
```

#### Set custom index before transpose

```python
df.set_index('Month/Year', inplace=True)
df.T
```

---

### Explode Column

- Turns list-type values in a column into separate rows

```python
df.explode('Column_A')
```

- Other columns are duplicated to match expanded rows

---

### Summary

- Use `.drop()`, `.sort_values()`, `.set_index()` for basic data manipulation
- Convert date strings with `pd.to_datetime()` and extract components using `strftime`
- Transpose using `.T`, useful after setting a meaningful index
- `.explode()` is useful to flatten list-type columns into multiple rows
```

In [454]:
import pandas as pd

In [455]:
tfl_df = pd.read_csv('/content/tfl-daily-cycle-hires.csv')

In [456]:
tfl_df.dtypes

Unnamed: 0,0
Day,object
Number of Bicycle Hires,float64
Unnamed: 2,float64


In [457]:
tfl_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4081 entries, 0 to 4080
Data columns (total 3 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   Day                      4081 non-null   object 
 1   Number of Bicycle Hires  4081 non-null   float64
 2   Unnamed: 2               0 non-null      float64
dtypes: float64(2), object(1)
memory usage: 95.8+ KB


In [458]:
tfl_df

Unnamed: 0,Day,Number of Bicycle Hires,Unnamed: 2
0,30/07/2010,6897.0,
1,31/07/2010,5564.0,
2,01/08/2010,4303.0,
3,02/08/2010,6642.0,
4,03/08/2010,7966.0,
...,...,...,...
4076,26/09/2021,45120.0,
4077,27/09/2021,32167.0,
4078,28/09/2021,32539.0,
4079,29/09/2021,39889.0,


In [459]:
tfl_df.drop(columns='Unnamed: 2',inplace=True)

In [460]:
tfl_df

Unnamed: 0,Day,Number of Bicycle Hires
0,30/07/2010,6897.0
1,31/07/2010,5564.0
2,01/08/2010,4303.0
3,02/08/2010,6642.0
4,03/08/2010,7966.0
...,...,...
4076,26/09/2021,45120.0
4077,27/09/2021,32167.0
4078,28/09/2021,32539.0
4079,29/09/2021,39889.0


In [461]:
tfl_df.dtypes

Unnamed: 0,0
Day,object
Number of Bicycle Hires,float64


In [462]:
tfl_df['Day'] = tfl_df['Day'].astype('datetime64[ns]')

In [463]:
tfl_df

Unnamed: 0,Day,Number of Bicycle Hires
0,2010-07-30,6897.0
1,2010-07-31,5564.0
2,2010-01-08,4303.0
3,2010-02-08,6642.0
4,2010-03-08,7966.0
...,...,...
4076,2021-09-26,45120.0
4077,2021-09-27,32167.0
4078,2021-09-28,32539.0
4079,2021-09-29,39889.0


In [464]:
tfl_df.dtypes

Unnamed: 0,0
Day,datetime64[ns]
Number of Bicycle Hires,float64


In [465]:
pd.to_datetime(tfl_df['Day'],format='%d/%m/%Y')

Unnamed: 0,Day
0,2010-07-30
1,2010-07-31
2,2010-01-08
3,2010-02-08
4,2010-03-08
...,...
4076,2021-09-26
4077,2021-09-27
4078,2021-09-28
4079,2021-09-29


In [466]:
tfl_df

Unnamed: 0,Day,Number of Bicycle Hires
0,2010-07-30,6897.0
1,2010-07-31,5564.0
2,2010-01-08,4303.0
3,2010-02-08,6642.0
4,2010-03-08,7966.0
...,...,...
4076,2021-09-26,45120.0
4077,2021-09-27,32167.0
4078,2021-09-28,32539.0
4079,2021-09-29,39889.0


In [467]:
tfl_df.sort_values(by='Number of Bicycle Hires',ascending=False,inplace=True)

In [468]:
tfl_df

Unnamed: 0,Day,Number of Bicycle Hires
1805,2015-09-07,73094.0
3592,2020-05-30,70170.0
3587,2020-05-25,67034.0
3606,2020-06-13,65045.0
3613,2020-06-20,64041.0
...,...,...
151,2010-12-28,3763.0
905,2013-01-20,3728.0
555,2012-05-02,3531.0
141,2010-12-18,2805.0


In [469]:
tfl_df.head()

Unnamed: 0,Day,Number of Bicycle Hires
1805,2015-09-07,73094.0
3592,2020-05-30,70170.0
3587,2020-05-25,67034.0
3606,2020-06-13,65045.0
3613,2020-06-20,64041.0


In [470]:
tfl_df.tail()

Unnamed: 0,Day,Number of Bicycle Hires
151,2010-12-28,3763.0
905,2013-01-20,3728.0
555,2012-05-02,3531.0
141,2010-12-18,2805.0
142,2010-12-19,2764.0


In [471]:
import datetime as dt

In [472]:
tfl_df['Day'].dt.strftime('%m-%y')

Unnamed: 0,Day
1805,09-15
3592,05-20
3587,05-20
3606,06-20
3613,06-20
...,...
151,12-10
905,01-13
555,05-12
141,12-10


In [473]:
tfl_df['Month / Year'] = tfl_df['Day'].dt.strftime('%b-%y')

In [474]:
tfl_df

Unnamed: 0,Day,Number of Bicycle Hires,Month / Year
1805,2015-09-07,73094.0,Sep-15
3592,2020-05-30,70170.0,May-20
3587,2020-05-25,67034.0,May-20
3606,2020-06-13,65045.0,Jun-20
3613,2020-06-20,64041.0,Jun-20
...,...,...,...
151,2010-12-28,3763.0,Dec-10
905,2013-01-20,3728.0,Jan-13
555,2012-05-02,3531.0,May-12
141,2010-12-18,2805.0,Dec-10


In [475]:
tfl_df.T

Unnamed: 0,1805,3592,3587,3606,3613,1833,3593,3607,3963,3641,...,155,149,1251,2,150,151,905,555,141,142
Day,2015-09-07 00:00:00,2020-05-30 00:00:00,2020-05-25 00:00:00,2020-06-13 00:00:00,2020-06-20 00:00:00,2015-06-08 00:00:00,2020-05-31 00:00:00,2020-06-14 00:00:00,2021-05-06 00:00:00,2020-07-18 00:00:00,...,2011-01-01 00:00:00,2010-12-26 00:00:00,2014-01-01 00:00:00,2010-01-08 00:00:00,2010-12-27 00:00:00,2010-12-28 00:00:00,2013-01-20 00:00:00,2012-05-02 00:00:00,2010-12-18 00:00:00,2010-12-19 00:00:00
Number of Bicycle Hires,73094.0,70170.0,67034.0,65045.0,64041.0,63963.0,63116.0,57516.0,56900.0,56654.0,...,4555.0,4383.0,4327.0,4303.0,3971.0,3763.0,3728.0,3531.0,2805.0,2764.0
Month / Year,Sep-15,May-20,May-20,Jun-20,Jun-20,Jun-15,May-20,Jun-20,May-21,Jul-20,...,Jan-11,Dec-10,Jan-14,Jan-10,Dec-10,Dec-10,Jan-13,May-12,Dec-10,Dec-10


In [476]:
tfl_df.transpose()

Unnamed: 0,1805,3592,3587,3606,3613,1833,3593,3607,3963,3641,...,155,149,1251,2,150,151,905,555,141,142
Day,2015-09-07 00:00:00,2020-05-30 00:00:00,2020-05-25 00:00:00,2020-06-13 00:00:00,2020-06-20 00:00:00,2015-06-08 00:00:00,2020-05-31 00:00:00,2020-06-14 00:00:00,2021-05-06 00:00:00,2020-07-18 00:00:00,...,2011-01-01 00:00:00,2010-12-26 00:00:00,2014-01-01 00:00:00,2010-01-08 00:00:00,2010-12-27 00:00:00,2010-12-28 00:00:00,2013-01-20 00:00:00,2012-05-02 00:00:00,2010-12-18 00:00:00,2010-12-19 00:00:00
Number of Bicycle Hires,73094.0,70170.0,67034.0,65045.0,64041.0,63963.0,63116.0,57516.0,56900.0,56654.0,...,4555.0,4383.0,4327.0,4303.0,3971.0,3763.0,3728.0,3531.0,2805.0,2764.0
Month / Year,Sep-15,May-20,May-20,Jun-20,Jun-20,Jun-15,May-20,Jun-20,May-21,Jul-20,...,Jan-11,Dec-10,Jan-14,Jan-10,Dec-10,Dec-10,Jan-13,May-12,Dec-10,Dec-10


In [477]:
tfl_df

Unnamed: 0,Day,Number of Bicycle Hires,Month / Year
1805,2015-09-07,73094.0,Sep-15
3592,2020-05-30,70170.0,May-20
3587,2020-05-25,67034.0,May-20
3606,2020-06-13,65045.0,Jun-20
3613,2020-06-20,64041.0,Jun-20
...,...,...,...
151,2010-12-28,3763.0,Dec-10
905,2013-01-20,3728.0,Jan-13
555,2012-05-02,3531.0,May-12
141,2010-12-18,2805.0,Dec-10


In [478]:
tfl_df.set_index(keys='Month / Year', inplace=True)

In [479]:
tfl_df.transpose()

Month / Year,Sep-15,May-20,May-20.1,Jun-20,Jun-20.1,Jun-15,May-20.2,Jun-20.2,May-21,Jul-20,...,Jan-11,Dec-10,Jan-14,Jan-10,Dec-10.1,Dec-10.2,Jan-13,May-12,Dec-10.3,Dec-10.4
Day,2015-09-07 00:00:00,2020-05-30 00:00:00,2020-05-25 00:00:00,2020-06-13 00:00:00,2020-06-20 00:00:00,2015-06-08 00:00:00,2020-05-31 00:00:00,2020-06-14 00:00:00,2021-05-06 00:00:00,2020-07-18 00:00:00,...,2011-01-01 00:00:00,2010-12-26 00:00:00,2014-01-01 00:00:00,2010-01-08 00:00:00,2010-12-27 00:00:00,2010-12-28 00:00:00,2013-01-20 00:00:00,2012-05-02 00:00:00,2010-12-18 00:00:00,2010-12-19 00:00:00
Number of Bicycle Hires,73094.0,70170.0,67034.0,65045.0,64041.0,63963.0,63116.0,57516.0,56900.0,56654.0,...,4555.0,4383.0,4327.0,4303.0,3971.0,3763.0,3728.0,3531.0,2805.0,2764.0


In [480]:
column_index = ['A','B']
data_list = [[[2,5], 'nested list'], [2, 'not a nested list'], [[3,4],'nested list 2']]

In [481]:
df=pd.DataFrame(data=data_list,columns=column_index)

In [482]:
df

Unnamed: 0,A,B
0,"[2, 5]",nested list
1,2,not a nested list
2,"[3, 4]",nested list 2


In [483]:
df.explode(column='A')

Unnamed: 0,A,B
0,2,nested list
0,5,nested list
1,2,not a nested list
2,3,nested list 2
2,4,nested list 2


In [487]:
tfl_df

Unnamed: 0_level_0,Day,Number of Bicycle Hires
Month / Year,Unnamed: 1_level_1,Unnamed: 2_level_1
Sep-15,2015-09-07,73094.0
May-20,2020-05-30,70170.0
May-20,2020-05-25,67034.0
Jun-20,2020-06-13,65045.0
Jun-20,2020-06-20,64041.0
...,...,...
Dec-10,2010-12-28,3763.0
Jan-13,2013-01-20,3728.0
May-12,2012-05-02,3531.0
Dec-10,2010-12-18,2805.0


In [489]:
tfl_df.reset_index(inplace=True)

In [490]:
tfl_df.pivot(index='Day',columns='Month / Year',values='Number of Bicycle Hires')

Month / Year,Apr-10,Apr-11,Apr-12,Apr-13,Apr-14,Apr-15,Apr-16,Apr-17,Apr-18,Apr-19,...,Sep-12,Sep-13,Sep-14,Sep-15,Sep-16,Sep-17,Sep-18,Sep-19,Sep-20,Sep-21
Day,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2010-01-08,,,,,,,,,,,...,,,,,,,,,,
2010-01-09,,,,,,,,,,,...,,,,,,,,,,
2010-01-10,,,,,,,,,,,...,,,,,,,,,,
2010-01-11,,,,,,,,,,,...,,,,,,,,,,
2010-01-12,,,,,,,,,,,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2021-12-05,,,,,,,,,,,...,,,,,,,,,,
2021-12-06,,,,,,,,,,,...,,,,,,,,,,
2021-12-07,,,,,,,,,,,...,,,,,,,,,,
2021-12-08,,,,,,,,,,,...,,,,,,,,,,


## Pandas Notes – Aggregation, GroupBy & Pivot Tables

---

### Setup

```python
import pandas as pd

df = pd.read_csv('top10_countries.csv')
df.drop(['Percentage of World', 'Date'], axis=1, inplace=True)
df.rename(columns={'Country/Dependency': 'Country'}, inplace=True)
df['Country'] = df['Country'].astype(str)
df['Region'] = df['Region'].astype(str)
df['Subregion'] = ['Eastern Asia', 'Southern Asia', 'Southeast Asia', 'Northern America',
                   'South America', 'Northern Africa', 'Western Asia', 'Eastern Africa',
                   'Northern Europe', 'Central Asia']
df['Subregion'] = df['Subregion'].astype(str)

# Reorder columns
df = df.reindex(columns=['Country', 'Region', 'Subregion', 'Population', 'Rank'])
```

---

### Basic Aggregation Functions

```python
df['Population'].sum()       # Total population
df['Population'].mean()      # Average
df['Population'].max()       # Max
df['Population'].min()       # Min
df['Population'].std()       # Standard deviation
df['Population'].count()     # Count
```

---

### `.describe()` Method

```python
df.describe()  # Summary statistics for all numeric columns
```

---

### Unique & Count of Unique Values

```python
df['Region'].unique()    # Unique regions
df['Region'].nunique()   # Count of unique regions
```

---

### GroupBy – Basic Usage

```python
df.groupby('Region')['Population'].sum()
df.groupby('Region')['Population'].mean()
df.groupby('Region')['Population'].max()
df.groupby('Region')['Population'].min()
```

---

### GroupBy – Multiple Columns

```python
df[['Region', 'Subregion', 'Population']].groupby(['Region', 'Subregion']).sum()
```

---

### GroupBy – Multiple Aggregation Functions

```python
df[['Region', 'Subregion', 'Population']].groupby(['Region', 'Subregion'])['Population'].agg(['sum', 'max'])
```

---

### Pivot Table (Function Syntax)

```python
pd.pivot_table(
    data=df,
    index='Country',
    columns='Subregion',
    values='Population',
    aggfunc='sum',
    fill_value=0
)
```

---

### Pivot Table (Method Syntax)

```python
df.pivot_table(
    index=['Region', 'Subregion'],
    values='Population',
    aggfunc='sum'
)
```

---

### Summary

- Use `.groupby()` to split data into groups, apply aggregation, and combine results.
- Use `.agg()` to apply multiple aggregations.
- Use `.pivot_table()` for Excel-style pivoting.
- `fill_value` in pivot tables helps handle missing data (e.g., replace `NaN` with `0`).

In [491]:
import pandas as pd

In [492]:
countries_df = pd.read_csv('top_10_countries.csv')

In [493]:
countries_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10 entries, 0 to 9
Data columns (total 6 columns):
 #   Column                Non-Null Count  Dtype 
---  ------                --------------  ----- 
 0   Rank                  10 non-null     int64 
 1   Country / Dependency  10 non-null     object
 2   Region                10 non-null     object
 3   Population            10 non-null     int64 
 4   % of world            10 non-null     object
 5   Date                  10 non-null     object
dtypes: int64(2), object(4)
memory usage: 612.0+ bytes


In [494]:
countries_df.drop(columns=['% of world','Date'], inplace=True)

In [495]:
countries_df.rename(columns={'Country / Dependency' : 'Country'}, inplace=True)

In [496]:
countries_df[['Country','Region']] = countries_df[['Country','Region']].astype('string')

In [497]:
sub_regions = ['Eastern Asia', 'Southern Asia', 'Northern America',
               'Southeast Asia', 'Southern Asia', 'Southern America',
               'Western Africa', 'Southern Asia', 'Eastern Europe', 'Central America']

In [498]:
countries_df['Sub Region'] = sub_regions

In [499]:
countries_df['Sub Region'] = countries_df['Sub Region'].astype('string')

In [500]:
countries_df.dtypes

Unnamed: 0,0
Rank,int64
Country,string[python]
Region,string[python]
Population,int64
Sub Region,string[python]


In [501]:
countries_df=countries_df.reindex(columns=['Rank','Country','Region','Sub Region','Population'])

In [502]:
countries_df

Unnamed: 0,Rank,Country,Region,Sub Region,Population
0,1,China,Asia,Eastern Asia,1412600000
1,2,India,Asia,Southern Asia,1386946912
2,3,United States,Americas,Northern America,333073186
3,4,Indonesia,Asia,Southeast Asia,271350000
4,5,Pakistan,Asia,Southern Asia,225200000
5,6,Brazil,Americas,Southern America,214231641
6,7,Nigeria,Africa,Western Africa,211401000
7,8,Bangladesh,Asia,Southern Asia,172062576
8,9,Russia,Europe,Eastern Europe,146171015
9,10,Mexico,Americas,Central America,126014024


In [508]:
int(countries_df['Population'].sum())

4499050354

In [510]:
countries_df['Population'].mean()

np.float64(449905035.4)

In [511]:
countries_df['Population'].max()

1412600000

In [512]:
countries_df['Population'].min()

126014024

In [513]:
countries_df['Population'].count()

np.int64(10)

In [515]:
countries_df['Population'].std()

504164358.8258898

In [517]:
countries_df['Population'].describe()

Unnamed: 0,Population
count,10.0
mean,449905000.0
std,504164400.0
min,126014000.0
25%,181897200.0
50%,219715800.0
75%,317642400.0
max,1412600000.0


In [518]:
countries_df.describe()

Unnamed: 0,Rank,Population
count,10.0,10.0
mean,5.5,449905000.0
std,3.02765,504164400.0
min,1.0,126014000.0
25%,3.25,181897200.0
50%,5.5,219715800.0
75%,7.75,317642400.0
max,10.0,1412600000.0


In [519]:
countries_df['Sub Region'].max()

'Western Africa'

In [520]:
countries_df['Region'].min()

'Africa'

In [521]:
countries_df['Region'].unique()

<StringArray>
['Asia', 'Americas', 'Africa', 'Europe']
Length: 4, dtype: string

In [522]:
countries_df['Region'].nunique()

4

In [523]:
countries_df

Unnamed: 0,Rank,Country,Region,Sub Region,Population
0,1,China,Asia,Eastern Asia,1412600000
1,2,India,Asia,Southern Asia,1386946912
2,3,United States,Americas,Northern America,333073186
3,4,Indonesia,Asia,Southeast Asia,271350000
4,5,Pakistan,Asia,Southern Asia,225200000
5,6,Brazil,Americas,Southern America,214231641
6,7,Nigeria,Africa,Western Africa,211401000
7,8,Bangladesh,Asia,Southern Asia,172062576
8,9,Russia,Europe,Eastern Europe,146171015
9,10,Mexico,Americas,Central America,126014024


In [524]:
countries_df.groupby(by='Region').sum()

Unnamed: 0_level_0,Rank,Country,Sub Region,Population
Region,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Africa,7,Nigeria,Western Africa,211401000
Americas,19,United States Brazil Mexico,Northern AmericaSouthern AmericaCentral America,673318851
Asia,20,China India Indonesia Pakistan Bangladesh,Eastern AsiaSouthern AsiaSoutheast AsiaSouther...,3468159488
Europe,9,Russia,Eastern Europe,146171015


In [527]:
countries_df[['Region','Population']].groupby(by='Region').sum()

Unnamed: 0_level_0,Population
Region,Unnamed: 1_level_1
Africa,211401000
Americas,673318851
Asia,3468159488
Europe,146171015


In [529]:
countries_df[['Region','Population']].groupby(by='Region').mean()

Unnamed: 0_level_0,Population
Region,Unnamed: 1_level_1
Africa,211401000.0
Americas,224439617.0
Asia,693631897.6
Europe,146171015.0


In [530]:
countries_df[['Region','Population']].groupby(by='Region').max()

Unnamed: 0_level_0,Population
Region,Unnamed: 1_level_1
Africa,211401000
Americas,333073186
Asia,1412600000
Europe,146171015


In [531]:
countries_df[['Region','Population']].groupby(by='Region').min()

Unnamed: 0_level_0,Population
Region,Unnamed: 1_level_1
Africa,211401000
Americas,126014024
Asia,172062576
Europe,146171015


In [533]:
countries_df[['Region','Population']].groupby(by='Region').std()

Unnamed: 0_level_0,Population
Region,Unnamed: 1_level_1
Africa,
Americas,103906300.0
Asia,645636500.0
Europe,


In [534]:
countries_df[['Region','Population']].groupby(by='Region').first()

Unnamed: 0_level_0,Population
Region,Unnamed: 1_level_1
Africa,211401000
Americas,333073186
Asia,1412600000
Europe,146171015


In [535]:
countries_df[['Region','Population']].groupby(by='Region').last()

Unnamed: 0_level_0,Population
Region,Unnamed: 1_level_1
Africa,211401000
Americas,126014024
Asia,172062576
Europe,146171015


In [536]:
countries_df[['Region','Population']].groupby(by='Region').size()

Unnamed: 0_level_0,0
Region,Unnamed: 1_level_1
Africa,1
Americas,3
Asia,5
Europe,1


In [537]:
countries_df[['Region','Population']].groupby(by='Region').count()

Unnamed: 0_level_0,Population
Region,Unnamed: 1_level_1
Africa,1
Americas,3
Asia,5
Europe,1


In [538]:
countries_df[['Region','Population']].groupby(by='Region').var()

Unnamed: 0_level_0,Population
Region,Unnamed: 1_level_1
Africa,
Americas,1.079653e+16
Asia,4.168465e+17
Europe,


In [539]:
countries_df[['Region','Population']].groupby(by='Region').sem()

Unnamed: 0_level_0,Population
Region,Unnamed: 1_level_1
Africa,
Americas,59990350.0
Asia,288737400.0
Europe,


In [540]:
countries_df[['Region','Population']].groupby(by='Region').describe()

Unnamed: 0_level_0,Population,Population,Population,Population,Population,Population,Population,Population
Unnamed: 0_level_1,count,mean,std,min,25%,50%,75%,max
Region,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2
Africa,1.0,211401000.0,,211401000.0,211401000.0,211401000.0,211401000.0,211401000.0
Americas,3.0,224439617.0,103906300.0,126014024.0,170122832.5,214231641.0,273652400.0,333073200.0
Asia,5.0,693631897.6,645636500.0,172062576.0,225200000.0,271350000.0,1386947000.0,1412600000.0
Europe,1.0,146171015.0,,146171015.0,146171015.0,146171015.0,146171000.0,146171000.0


In [544]:
countries_df[['Region','Population']].groupby(by='Region').nth(2)

Unnamed: 0,Region,Population
3,Asia,271350000
9,Americas,126014024


In [546]:
countries_df[['Region','Sub Region','Population']].groupby(by=['Region','Sub Region']).sum()

Unnamed: 0_level_0,Unnamed: 1_level_0,Population
Region,Sub Region,Unnamed: 2_level_1
Africa,Western Africa,211401000
Americas,Central America,126014024
Americas,Northern America,333073186
Americas,Southern America,214231641
Asia,Eastern Asia,1412600000
Asia,Southeast Asia,271350000
Asia,Southern Asia,1784209488
Europe,Eastern Europe,146171015


In [547]:
countries_df[['Region','Sub Region','Population']].groupby(by=['Region','Sub Region']).agg([sum,max])

Unnamed: 0_level_0,Unnamed: 1_level_0,Population,Population
Unnamed: 0_level_1,Unnamed: 1_level_1,sum,max
Region,Sub Region,Unnamed: 2_level_2,Unnamed: 3_level_2
Africa,Western Africa,211401000,211401000
Americas,Central America,126014024,126014024
Americas,Northern America,333073186,333073186
Americas,Southern America,214231641,214231641
Asia,Eastern Asia,1412600000,1412600000
Asia,Southeast Asia,271350000,271350000
Asia,Southern Asia,1784209488,1386946912
Europe,Eastern Europe,146171015,146171015


In [548]:
countries_df

Unnamed: 0,Rank,Country,Region,Sub Region,Population
0,1,China,Asia,Eastern Asia,1412600000
1,2,India,Asia,Southern Asia,1386946912
2,3,United States,Americas,Northern America,333073186
3,4,Indonesia,Asia,Southeast Asia,271350000
4,5,Pakistan,Asia,Southern Asia,225200000
5,6,Brazil,Americas,Southern America,214231641
6,7,Nigeria,Africa,Western Africa,211401000
7,8,Bangladesh,Asia,Southern Asia,172062576
8,9,Russia,Europe,Eastern Europe,146171015
9,10,Mexico,Americas,Central America,126014024


In [549]:
pd.pivot_table(data=countries_df,index='Region',values='Population',columns='Sub Region',aggfunc='sum')

Sub Region,Central America,Eastern Asia,Eastern Europe,Northern America,Southeast Asia,Southern America,Southern Asia,Western Africa
Region,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
Africa,,,,,,,,211401000.0
Americas,126014024.0,,,333073186.0,,214231641.0,,
Asia,,1412600000.0,,,271350000.0,,1784209000.0,
Europe,,,146171015.0,,,,,


In [550]:
pd.pivot_table(data=countries_df,index='Region',values='Population',columns='Sub Region',aggfunc='sum',fill_value=0)

Sub Region,Central America,Eastern Asia,Eastern Europe,Northern America,Southeast Asia,Southern America,Southern Asia,Western Africa
Region,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
Africa,0,0,0,0,0,0,0,211401000
Americas,126014024,0,0,333073186,0,214231641,0,0
Asia,0,1412600000,0,0,271350000,0,1784209488,0
Europe,0,0,146171015,0,0,0,0,0


In [None]:
countries_df.pivot_table(index='Region')

In [556]:
countries_df.pivot(index=['Region','Sub Region'], values='Population',columns='Country')

Unnamed: 0_level_0,Country,Bangladesh,Brazil,China,India,Indonesia,Mexico,Nigeria,Pakistan,Russia,United States
Region,Sub Region,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
Africa,Western Africa,,,,,,,211401000.0,,,
Americas,Central America,,,,,,126014024.0,,,,
Americas,Northern America,,,,,,,,,,333073186.0
Americas,Southern America,,214231641.0,,,,,,,,
Asia,Eastern Asia,,,1412600000.0,,,,,,,
Asia,Southeast Asia,,,,,271350000.0,,,,,
Asia,Southern Asia,172062576.0,,,1386947000.0,,,,225200000.0,,
Europe,Eastern Europe,,,,,,,,,146171015.0,


## Pandas Notes – Data Cleansing

---

### Why Data Cleansing is Important

Before performing analysis or visualization, it's essential to clean your data:

- Remove unwanted characters (e.g., %, $)
- Convert data types (e.g., string to datetime)
- Handle missing values
- Remove inconsistent formatting (e.g., whitespace)

---

### Dataset Setup

```python
import pandas as pd

# Simulate a messy employee dataset
data = {
    'Employee': [' Alice ', 'Bob', ' Charlie ', 'David', 'Eve', None, 'Grace', 'Hank'],
    'Position': ['Manager', 'Analyst', 'Manager', 'Analyst', None, 'Manager', 'Director', 'Analyst'],
    'Salary': [70000, 65000, 72000, None, 68000, 75000, None, None]
}

df = pd.DataFrame(data)
```

---

### Initial Exploration

```python
df.head()       # First 5 rows
df.tail(2)      # Last 2 rows
```

---

### Trim Whitespace in Strings

```python
df['Employee'] = df['Employee'].apply(lambda x: x.strip() if isinstance(x, str) else x)
```

---

### Detect Missing Values

```python
df.isnull()         # Boolean matrix of nulls
df.isnull().sum()   # Count of nulls per column
```

---

### Drop Missing Values

- Drop rows with any nulls:

```python
df.dropna(inplace=True)
```

- Drop rows with fewer than 2 non-null values:

```python
df = df.dropna(thresh=2)
```

- Drop columns with nulls:

```python
df.dropna(axis=1, inplace=True)
```

- Drop columns unless they have at least 7 non-null values:

```python
df.dropna(axis=1, thresh=7, inplace=True)
```

---

### Fill Missing Values

- Fill with a constant:

```python
df.fillna('No Data', inplace=True)
```

- Fill with zero:

```python
df.fillna(0, inplace=True)
```

- Forward fill (use previous row's value):

```python
df.fillna(method='ffill', inplace=True)
```

- Backward fill (use next row's value):

```python
df.fillna(method='bfill', inplace=True)
```

- Fill salary with column mean:

```python
df['Salary'].fillna(df['Salary'].mean(), inplace=True)
```

---

### Summary Table

| Task                      | Method/Function                     |
|---------------------------|-------------------------------------|
| Trim whitespace           | `apply(lambda x: x.strip())`        |
| Detect nulls              | `isnull()` / `notnull()`            |
| Drop missing values       | `dropna()`                          |
| Fill missing values       | `fillna()`                          |
| Forward/backward fill     | `fillna(method='ffill'/'bfill')`    |
| Fill with mean            | `fillna(df['col'].mean())`          |

---

### Good Practice

- Always inspect your data first using `.head()`, `.dtypes`, and `.isnull()`
- Clean string fields before aggregation
- Convert columns to the correct data type before performing operations

```python
df['Position'] = df['Position'].astype(str).apply(lambda x: x.strip().title())
```

In [617]:
import pandas as pd
import numpy as np

In [618]:
employees = ['  Employee 1', 'Employee 2    ', 'Employee 3   ', '   Employee 4', '   Employee 5', '  Employee 6', 'Employee 7', 'Employee 8']
positions = ['Manager', 'Developer', 'Analyst', 'Engineer', 'Designer', None, 'Senior Manager', None]
salary = [100000, 90000, 80000, 70000, 60000, 50000, None, None]
columns = ['Employee', 'Position', 'Salary']

In [619]:
df = pd.DataFrame(data=list(zip(employees,positions,salary)),columns=columns)

In [620]:
df

Unnamed: 0,Employee,Position,Salary
0,Employee 1,Manager,100000.0
1,Employee 2,Developer,90000.0
2,Employee 3,Analyst,80000.0
3,Employee 4,Engineer,70000.0
4,Employee 5,Designer,60000.0
5,Employee 6,,50000.0
6,Employee 7,Senior Manager,
7,Employee 8,,


In [621]:
df.head()

Unnamed: 0,Employee,Position,Salary
0,Employee 1,Manager,100000.0
1,Employee 2,Developer,90000.0
2,Employee 3,Analyst,80000.0
3,Employee 4,Engineer,70000.0
4,Employee 5,Designer,60000.0


In [622]:
df.tail(6)

Unnamed: 0,Employee,Position,Salary
2,Employee 3,Analyst,80000.0
3,Employee 4,Engineer,70000.0
4,Employee 5,Designer,60000.0
5,Employee 6,,50000.0
6,Employee 7,Senior Manager,
7,Employee 8,,


In [623]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8 entries, 0 to 7
Data columns (total 3 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   Employee  8 non-null      object 
 1   Position  6 non-null      object 
 2   Salary    6 non-null      float64
dtypes: float64(1), object(2)
memory usage: 324.0+ bytes


In [624]:
df.describe()

Unnamed: 0,Salary
count,6.0
mean,75000.0
std,18708.286934
min,50000.0
25%,62500.0
50%,75000.0
75%,87500.0
max,100000.0


In [625]:
df['Employee']

Unnamed: 0,Employee
0,Employee 1
1,Employee 2
2,Employee 3
3,Employee 4
4,Employee 5
5,Employee 6
6,Employee 7
7,Employee 8


In [626]:
df['Employee'] = df['Employee'].apply(lambda x: x.strip())

In [627]:
df['Employee']

Unnamed: 0,Employee
0,Employee 1
1,Employee 2
2,Employee 3
3,Employee 4
4,Employee 5
5,Employee 6
6,Employee 7
7,Employee 8


In [628]:
df.isnull()

Unnamed: 0,Employee,Position,Salary
0,False,False,False
1,False,False,False
2,False,False,False
3,False,False,False
4,False,False,False
5,False,True,False
6,False,False,True
7,False,True,True


In [629]:
df.isna()

Unnamed: 0,Employee,Position,Salary
0,False,False,False
1,False,False,False
2,False,False,False
3,False,False,False
4,False,False,False
5,False,True,False
6,False,False,True
7,False,True,True


In [630]:
#df.dropna()

In [631]:
df.dropna(thresh=2)

Unnamed: 0,Employee,Position,Salary
0,Employee 1,Manager,100000.0
1,Employee 2,Developer,90000.0
2,Employee 3,Analyst,80000.0
3,Employee 4,Engineer,70000.0
4,Employee 5,Designer,60000.0
5,Employee 6,,50000.0
6,Employee 7,Senior Manager,


In [643]:
df.dropna(axis=1,thresh=7)

Unnamed: 0,Employee
0,Employee 1
1,Employee 2
2,Employee 3
3,Employee 4
4,Employee 5
5,Employee 6
6,Employee 7
7,Employee 8


In [645]:
df.fillna('no data')

Unnamed: 0,Employee,Position,Salary
0,Employee 1,Manager,100000.0
1,Employee 2,Developer,90000.0
2,Employee 3,Analyst,80000.0
3,Employee 4,Engineer,70000.0
4,Employee 5,Designer,60000.0
5,Employee 6,no data,50000.0
6,Employee 7,Senior Manager,no data
7,Employee 8,no data,no data


In [646]:
df.fillna(0)

Unnamed: 0,Employee,Position,Salary
0,Employee 1,Manager,100000.0
1,Employee 2,Developer,90000.0
2,Employee 3,Analyst,80000.0
3,Employee 4,Engineer,70000.0
4,Employee 5,Designer,60000.0
5,Employee 6,0,50000.0
6,Employee 7,Senior Manager,0.0
7,Employee 8,0,0.0


In [648]:
df.fillna(method='ffill')

Unnamed: 0,Employee,Position,Salary
0,Employee 1,Manager,100000.0
1,Employee 2,Developer,90000.0
2,Employee 3,Analyst,80000.0
3,Employee 4,Engineer,70000.0
4,Employee 5,Designer,60000.0
5,Employee 6,Designer,50000.0
6,Employee 7,Senior Manager,50000.0
7,Employee 8,Senior Manager,50000.0


In [658]:
df['Salary']=df['Salary'].fillna(value=df['Salary'].mean())

In [659]:
df

Unnamed: 0,Employee,Position,Salary
0,Employee 1,Manager,100000.0
1,Employee 2,Developer,90000.0
2,Employee 3,Analyst,80000.0
3,Employee 4,Engineer,70000.0
4,Employee 5,Designer,60000.0
5,Employee 6,,50000.0
6,Employee 7,Senior Manager,75000.0
7,Employee 8,,75000.0


## Pandas Notes – Working with Multiple DataFrames

---

### Concatenating DataFrames

**Use `pd.concat()` to stack DataFrames either row-wise or column-wise.**

```python
import pandas as pd
import numpy as np

# Create DataFrames
df1 = pd.DataFrame(np.full((5, 5), 1), index=list('abcde'), columns=list('ABCDE'))
df2 = pd.DataFrame(np.full((5, 5), 2), index=list('fghij'), columns=list('ABCDE'))
df3 = pd.DataFrame(np.full((5, 5), 3), index=list('abcde'), columns=list('VWXYZ'))
```

**Concatenate row-wise (default axis=0):**

```python
pd.concat([df1, df2])
```

**Concatenate column-wise (axis=1):**

```python
pd.concat([df1, df3], axis=1)
```

If row or column indexes don’t match, you will get `NaN` for missing data.

---

### Joining and Merging DataFrames

Used to combine datasets based on a common key (e.g., `dept_id`).

---

#### Relational Data Overview

- Tables are split to reduce redundancy.
- Keys like `dept_id` are used to link related tables.
- Types of joins:
  - **Inner**: Only matched rows
  - **Left**: All left table rows, matched right table rows
  - **Right**: All right table rows, matched left table rows
  - **Outer**: All rows from both tables, with `NaN` for unmatched

---

### Joining with `join()`

```python
emp = pd.read_excel('employee_hr.xlsx', sheet_name='employees')
dept = pd.read_excel('employee_hr.xlsx', sheet_name='departments')

# Join dept to emp using join
emp.join(
    other=dept.set_index('id'),
    on='dept_id',
    how='left'
)
```

**Key Parameters**:
- `other`: DataFrame to join
- `on`: key in the current DataFrame
- `how`: type of join (`left`, `right`, `outer`, `inner`)

`join()` joins on the index of the other DataFrame. Use `set_index()` if needed.

---

### Merging with `merge()`

```python
pd.merge(
    left=emp,
    right=dept,
    left_on='dept_id',
    right_on='id',
    how='inner'
)
```

**Advantages of `merge()`**:
- You can use any column as the key
- Full control over both sides of the join

---

### Selecting Specific Columns After Merge

```python
merged = pd.merge(emp, dept, left_on='dept_id', right_on='id', how='left')
merged[['emp_id', 'dept_name']]
```

---

### Notes

- Duplicated keys in either table will duplicate rows in the result
- Inspect the result using `.head()` or `.info()`
- Ensure column data types match before merging
- Use `.str.strip()` to clean strings before joining or merging

---

### Summary Table

| Task                         | Function                  | Description                                      |
|------------------------------|---------------------------|--------------------------------------------------|
| Row-wise concatenation       | `pd.concat(..., axis=0)`  | Stacks data vertically                           |
| Column-wise concatenation    | `pd.concat(..., axis=1)`  | Appends columns side-by-side                     |
| Join by index                | `df.join()`               | Requires right DataFrame's key to be index       |
| Merge by column key          | `pd.merge()`              | Join on specific columns from each DataFrame     |
| Filter specific columns      | `df[['col1', 'col2']]`    | Selects specific columns from a DataFrame        |

---

### Example: Inner Join with `merge`

```python
pd.merge(emp, dept, left_on='dept_id', right_on='id', how='inner')
```

---

In [660]:
import pandas as pd
import numpy as np

In [661]:
data_1 = np.full((5,5),1)
index_1 = (0,1,2,3,4)
columns_1 = (0,1,2,3,4)

In [662]:
df_1 = pd.DataFrame(data_1,index_1,columns_1)

In [663]:
data_2 = np.full((5,5),2)
index_2 = (5,6,7,8,9)
columns_2 = (0,1,2,3,4)

In [664]:
df_2 = pd.DataFrame(data_2,index_2,columns_2)

In [665]:
data_3 = np.full((5,5),3)
index_3 = (0,1,2,3,4)
columns_3 = (5,6,7,8,9)

In [670]:
df_3 = pd.DataFrame(data_3,index_3,columns_3)

In [671]:
df_1

Unnamed: 0,0,1,2,3,4
0,1,1,1,1,1
1,1,1,1,1,1
2,1,1,1,1,1
3,1,1,1,1,1
4,1,1,1,1,1


In [672]:
df_2

Unnamed: 0,0,1,2,3,4
5,2,2,2,2,2
6,2,2,2,2,2
7,2,2,2,2,2
8,2,2,2,2,2
9,2,2,2,2,2


In [673]:
df_3

Unnamed: 0,5,6,7,8,9
0,3,3,3,3,3
1,3,3,3,3,3
2,3,3,3,3,3
3,3,3,3,3,3
4,3,3,3,3,3


In [674]:
pd.concat([df_1,df_2])

Unnamed: 0,0,1,2,3,4
0,1,1,1,1,1
1,1,1,1,1,1
2,1,1,1,1,1
3,1,1,1,1,1
4,1,1,1,1,1
5,2,2,2,2,2
6,2,2,2,2,2
7,2,2,2,2,2
8,2,2,2,2,2
9,2,2,2,2,2


In [675]:
pd.concat([df_1,df_2],axis=1)

Unnamed: 0,0,1,2,3,4,0.1,1.1,2.1,3.1,4.1
0,1.0,1.0,1.0,1.0,1.0,,,,,
1,1.0,1.0,1.0,1.0,1.0,,,,,
2,1.0,1.0,1.0,1.0,1.0,,,,,
3,1.0,1.0,1.0,1.0,1.0,,,,,
4,1.0,1.0,1.0,1.0,1.0,,,,,
5,,,,,,2.0,2.0,2.0,2.0,2.0
6,,,,,,2.0,2.0,2.0,2.0,2.0
7,,,,,,2.0,2.0,2.0,2.0,2.0
8,,,,,,2.0,2.0,2.0,2.0,2.0
9,,,,,,2.0,2.0,2.0,2.0,2.0


In [677]:
pd.concat([df_1,df_3])

Unnamed: 0,0,1,2,3,4,5,6,7,8,9
0,1.0,1.0,1.0,1.0,1.0,,,,,
1,1.0,1.0,1.0,1.0,1.0,,,,,
2,1.0,1.0,1.0,1.0,1.0,,,,,
3,1.0,1.0,1.0,1.0,1.0,,,,,
4,1.0,1.0,1.0,1.0,1.0,,,,,
0,,,,,,3.0,3.0,3.0,3.0,3.0
1,,,,,,3.0,3.0,3.0,3.0,3.0
2,,,,,,3.0,3.0,3.0,3.0,3.0
3,,,,,,3.0,3.0,3.0,3.0,3.0
4,,,,,,3.0,3.0,3.0,3.0,3.0


In [678]:
pd.concat([df_1,df_3],axis=1)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9
0,1,1,1,1,1,3,3,3,3,3
1,1,1,1,1,1,3,3,3,3,3
2,1,1,1,1,1,3,3,3,3,3
3,1,1,1,1,1,3,3,3,3,3
4,1,1,1,1,1,3,3,3,3,3


In [679]:
pd.concat([df_1,df_2,df_3])

Unnamed: 0,0,1,2,3,4,5,6,7,8,9
0,1.0,1.0,1.0,1.0,1.0,,,,,
1,1.0,1.0,1.0,1.0,1.0,,,,,
2,1.0,1.0,1.0,1.0,1.0,,,,,
3,1.0,1.0,1.0,1.0,1.0,,,,,
4,1.0,1.0,1.0,1.0,1.0,,,,,
5,2.0,2.0,2.0,2.0,2.0,,,,,
6,2.0,2.0,2.0,2.0,2.0,,,,,
7,2.0,2.0,2.0,2.0,2.0,,,,,
8,2.0,2.0,2.0,2.0,2.0,,,,,
9,2.0,2.0,2.0,2.0,2.0,,,,,


In [680]:
pd.concat([df_1,df_2,df_3],axis=1)

Unnamed: 0,0,1,2,3,4,0.1,1.1,2.1,3.1,4.1,5,6,7,8,9
0,1.0,1.0,1.0,1.0,1.0,,,,,,3.0,3.0,3.0,3.0,3.0
1,1.0,1.0,1.0,1.0,1.0,,,,,,3.0,3.0,3.0,3.0,3.0
2,1.0,1.0,1.0,1.0,1.0,,,,,,3.0,3.0,3.0,3.0,3.0
3,1.0,1.0,1.0,1.0,1.0,,,,,,3.0,3.0,3.0,3.0,3.0
4,1.0,1.0,1.0,1.0,1.0,,,,,,3.0,3.0,3.0,3.0,3.0
5,,,,,,2.0,2.0,2.0,2.0,2.0,,,,,
6,,,,,,2.0,2.0,2.0,2.0,2.0,,,,,
7,,,,,,2.0,2.0,2.0,2.0,2.0,,,,,
8,,,,,,2.0,2.0,2.0,2.0,2.0,,,,,
9,,,,,,2.0,2.0,2.0,2.0,2.0,,,,,


In [681]:
emp = pd.read_excel('employees_hr.xls',sheet_name='employees')

In [682]:
dept = pd.read_excel('employees_hr.xls',sheet_name='departments')

In [683]:
emp.head()

Unnamed: 0,emp_id,first_name,last_name,dept_id,salary
0,677509,Lois,Walker,2,51356
1,940761,Brenda,Robinson,2,40887
2,428945,Joe,Robinson,1,50445
3,408351,Diane,Evans,5,41728
4,193819,Benjamin,Russell,5,47202


In [684]:
dept.head()

Unnamed: 0,id,dept_name,dept_location
0,0,Human Resources,USA
1,1,Finance,Europe
2,2,Marketing,USA
3,3,Production,Europe
4,4,Sales,USA


In [686]:
emp.join(dept.set_index('id'), on='dept_id')

Unnamed: 0,emp_id,first_name,last_name,dept_id,salary,dept_name,dept_location
0,677509,Lois,Walker,2,51356,Marketing,USA
1,940761,Brenda,Robinson,2,40887,Marketing,USA
2,428945,Joe,Robinson,1,50445,Finance,Europe
3,408351,Diane,Evans,5,41728,R&D,USA
4,193819,Benjamin,Russell,5,47202,R&D,USA
5,499687,Patrick,Bailey,0,61603,Human Resources,USA
6,539712,Nancy,Baker,4,57919,Sales,USA
7,380086,Carol,Murphy,4,64590,Sales,USA
8,477616,Frances,Young,5,32196,R&D,USA
9,329752,Lillian,Brown,4,60078,Sales,USA


In [687]:
emp.join(dept.set_index('id'),on='dept_id',how='outer')

Unnamed: 0,emp_id,first_name,last_name,dept_id,salary,dept_name,dept_location
5.0,499687.0,Patrick,Bailey,0,61603.0,Human Resources,USA
11.0,621833.0,Gregory,Edwards,0,38068.0,Human Resources,USA
12.0,456747.0,Roy,Griffin,0,54965.0,Human Resources,USA
14.0,333476.0,Mary,Wilson,0,54362.0,Human Resources,USA
2.0,428945.0,Joe,Robinson,1,50445.0,Finance,Europe
13.0,278556.0,Richard,Mitchell,1,78451.0,Finance,Europe
0.0,677509.0,Lois,Walker,2,51356.0,Marketing,USA
1.0,940761.0,Brenda,Robinson,2,40887.0,Marketing,USA
15.0,218791.0,Aaron,Price,3,41690.0,Production,Europe
16.0,134841.0,Donna,Brown,3,44665.0,Production,Europe


In [688]:
emp.join(dept.set_index('id'),on='dept_id',how='left')[['emp_id','dept_name']]

Unnamed: 0,emp_id,dept_name
0,677509,Marketing
1,940761,Marketing
2,428945,Finance
3,408351,R&D
4,193819,R&D
5,499687,Human Resources
6,539712,Sales
7,380086,Sales
8,477616,R&D
9,329752,Sales


In [689]:
pd.merge(left=emp,right=dept,left_on='dept_id',right_on='id',how='inner')

Unnamed: 0,emp_id,first_name,last_name,dept_id,salary,id,dept_name,dept_location
0,677509,Lois,Walker,2,51356,2,Marketing,USA
1,940761,Brenda,Robinson,2,40887,2,Marketing,USA
2,428945,Joe,Robinson,1,50445,1,Finance,Europe
3,408351,Diane,Evans,5,41728,5,R&D,USA
4,193819,Benjamin,Russell,5,47202,5,R&D,USA
5,499687,Patrick,Bailey,0,61603,0,Human Resources,USA
6,539712,Nancy,Baker,4,57919,4,Sales,USA
7,380086,Carol,Murphy,4,64590,4,Sales,USA
8,477616,Frances,Young,5,32196,5,R&D,USA
9,329752,Lillian,Brown,4,60078,4,Sales,USA


In [690]:
pd.merge(left=emp,right=dept,left_on='dept_id',right_on='id',how='right')

Unnamed: 0,emp_id,first_name,last_name,dept_id,salary,id,dept_name,dept_location
0,499687.0,Patrick,Bailey,0.0,61603.0,0,Human Resources,USA
1,621833.0,Gregory,Edwards,0.0,38068.0,0,Human Resources,USA
2,456747.0,Roy,Griffin,0.0,54965.0,0,Human Resources,USA
3,333476.0,Mary,Wilson,0.0,54362.0,0,Human Resources,USA
4,428945.0,Joe,Robinson,1.0,50445.0,1,Finance,Europe
5,278556.0,Richard,Mitchell,1.0,78451.0,1,Finance,Europe
6,677509.0,Lois,Walker,2.0,51356.0,2,Marketing,USA
7,940761.0,Brenda,Robinson,2.0,40887.0,2,Marketing,USA
8,218791.0,Aaron,Price,3.0,41690.0,3,Production,Europe
9,134841.0,Donna,Brown,3.0,44665.0,3,Production,Europe


## Pandas Notes – Windowing Operations (Expanding & Rolling)

---

### Overview

**Windowing operations** allow you to compute aggregations over a sliding or cumulative view of your dataset.  
There are two main types:

- **Expanding**: Calculates a cumulative aggregation from the beginning of the dataset up to the current row.
- **Rolling**: Calculates an aggregation over a defined rolling window size (e.g., past 3 rows).

---

### Dataset Setup

```python
import pandas as pd
import datetime

# Read CSV
df = pd.read_csv('tfl_daily_bike.csv')

# Convert date column to datetime
df['day'] = pd.to_datetime(df['day'])

# Sort by date
df.sort_values('day', inplace=True)

# Extract Month-Year from date
df['month_year'] = df['day'].dt.strftime('%m-%Y')

# Drop unnamed column (if exists)
df = df.loc[:, ~df.columns.str.contains('^Unnamed')]

# Group by Month-Year, aggregate by sum
monthly = df.groupby('month_year')['bicycle_hires'].sum().reset_index()

# Show last 12 months
monthly = monthly.tail(12).reset_index(drop=True)
```

---

### Expanding Aggregation (Cumulative Sum)

```python
monthly['cumulative_total'] = monthly['bicycle_hires'].expanding().sum()
```

- Calculates the running total from the top down.
- Can also use `.mean()`, `.max()`, `.min()`, etc.

---

### Rolling Aggregation (Sliding Window)

```python
# Rolling 3-month total
monthly['rolling_3'] = monthly['bicycle_hires'].rolling(window=3).sum()

# Rolling 6-month total
monthly['rolling_6'] = monthly['bicycle_hires'].rolling(window=6).sum()
```

- You can also use `.mean()`, `.max()`, etc.
- The first `window - 1` rows will be `NaN` as there isn't enough data.

---

### Example Output (Head of DataFrame)

| month_year | bicycle_hires | cumulative_total | rolling_3 | rolling_6 |
|------------|----------------|------------------|-----------|-----------|
| 01-2023    | 500000         | 500000           | NaN       | NaN       |
| 02-2023    | 510000         | 1010000          | NaN       | NaN       |
| 03-2023    | 520000         | 1530000          | 1530000   | NaN       |
| 04-2023    | 530000         | 2060000          | 1560000   | NaN       |
| 05-2023    | 540000         | 2600000          | 1590000   | NaN       |
| 06-2023    | 550000         | 3150000          | 1620000   | 3150000   |

---

### Notes

- `.expanding()` accumulates from the first row to current row.
- `.rolling(window=n)` looks at the current row and `n-1` previous rows.
- Aggregations supported: `sum()`, `mean()`, `std()`, `min()`, `max()`, etc.
- Use `.reset_index()` to reindex after slicing or grouping if needed.
- Ensure your data is sorted before applying these operations for meaningful results.

---

### Summary

| Method       | Description                                      | Use Case Example                  |
|--------------|--------------------------------------------------|-----------------------------------|
| `.expanding()` | Cumulative aggregations from top to bottom       | Cumulative revenue, totals        |
| `.rolling(n)`  | Aggregation over a fixed-size rolling window     | Moving average, rolling sum       |

---

```python
# Example: 3-month rolling average of hires
monthly['rolling_avg'] = monthly['bicycle_hires'].rolling(3).mean()
```

---

In [691]:
import pandas as pd
import datetime as dt

In [692]:
tfl_df = pd.read_csv('tfl-daily-cycle-hires.csv')

In [693]:
tfl_df

Unnamed: 0,Day,Number of Bicycle Hires,Unnamed: 2
0,30/07/2010,6897.0,
1,31/07/2010,5564.0,
2,01/08/2010,4303.0,
3,02/08/2010,6642.0,
4,03/08/2010,7966.0,
...,...,...,...
4076,26/09/2021,45120.0,
4077,27/09/2021,32167.0,
4078,28/09/2021,32539.0,
4079,29/09/2021,39889.0,


In [694]:
tfl_df['Day'] = pd.to_datetime(tfl_df['Day'],format='%d/%m/%Y')

In [695]:
tfl_df.sort_values('Day',inplace=True)

In [696]:
tfl_df['Day']=tfl_df['Day'].dt.strftime('%Y-%m')

In [697]:
tfl_df.rename({'Day':'Month-Year'},axis=1,inplace=True)

In [698]:
tfl_df.drop('Unnamed: 2',axis=1,inplace=True)

In [699]:
tfl_df

Unnamed: 0,Month-Year,Number of Bicycle Hires
0,2010-07,6897.0
1,2010-07,5564.0
2,2010-08,4303.0
3,2010-08,6642.0
4,2010-08,7966.0
...,...,...
4076,2021-09,45120.0
4077,2021-09,32167.0
4078,2021-09,32539.0
4079,2021-09,39889.0


In [700]:
tfl_12_month = tfl_df.groupby(by='Month-Year').sum().tail(12).reset_index()

In [701]:
tfl_12_month

Unnamed: 0,Month-Year,Number of Bicycle Hires
0,2020-10,848233.0
1,2020-11,760245.0
2,2020-12,589091.0
3,2021-01,409761.0
4,2021-02,510806.0
5,2021-03,748233.0
6,2021-04,943328.0
7,2021-05,921413.0
8,2021-06,1183119.0
9,2021-07,1167625.0


In [702]:
tfl_12_month['Number of Bicycle Hires'].expanding().sum()

Unnamed: 0,Number of Bicycle Hires
0,848233.0
1,1608478.0
2,2197569.0
3,2607330.0
4,3118136.0
5,3866369.0
6,4809697.0
7,5731110.0
8,6914229.0
9,8081854.0


In [708]:
tfl_12_month['Cumulative Sum'] = tfl_12_month['Number of Bicycle Hires'].expanding().sum()

In [709]:
tfl_12_month

Unnamed: 0,Month-Year,Number of Bicycle Hires,Cumulative_Sum,Rolling Avg 3 months,Rolling Avg 6 months,Cumulative Sum
0,2020-10,848233.0,848233.0,,,848233.0
1,2020-11,760245.0,1608478.0,,,1608478.0
2,2020-12,589091.0,2197569.0,2197569.0,,2197569.0
3,2021-01,409761.0,2607330.0,1759097.0,,2607330.0
4,2021-02,510806.0,3118136.0,1509658.0,,3118136.0
5,2021-03,748233.0,3866369.0,1668800.0,3866369.0,3866369.0
6,2021-04,943328.0,4809697.0,2202367.0,3961464.0,4809697.0
7,2021-05,921413.0,5731110.0,2612974.0,4122632.0,5731110.0
8,2021-06,1183119.0,6914229.0,3047860.0,4716660.0,6914229.0
9,2021-07,1167625.0,8081854.0,3272157.0,5474524.0,8081854.0


In [710]:
tfl_12_month['Rolling Avg 3 months'] = tfl_12_month['Number of Bicycle Hires'].rolling(3).sum()

In [711]:
tfl_12_month['Rolling Avg 6 months'] = tfl_12_month['Number of Bicycle Hires'].rolling(6).sum()

In [712]:
tfl_12_month

Unnamed: 0,Month-Year,Number of Bicycle Hires,Cumulative_Sum,Rolling Avg 3 months,Rolling Avg 6 months,Cumulative Sum
0,2020-10,848233.0,848233.0,,,848233.0
1,2020-11,760245.0,1608478.0,,,1608478.0
2,2020-12,589091.0,2197569.0,2197569.0,,2197569.0
3,2021-01,409761.0,2607330.0,1759097.0,,2607330.0
4,2021-02,510806.0,3118136.0,1509658.0,,3118136.0
5,2021-03,748233.0,3866369.0,1668800.0,3866369.0,3866369.0
6,2021-04,943328.0,4809697.0,2202367.0,3961464.0,4809697.0
7,2021-05,921413.0,5731110.0,2612974.0,4122632.0,5731110.0
8,2021-06,1183119.0,6914229.0,3047860.0,4716660.0,6914229.0
9,2021-07,1167625.0,8081854.0,3272157.0,5474524.0,8081854.0
