# Pandas Exercises

## Overview

This module covers essential Pandas operations, including data manipulation, analysis, and basic statistical functions. It provides hands-on experience with real-world data using the Pandas library.

## Learning Objectives

- Convert list of dictionaries and CSV files to DataFrames
- Perform data access operations using Pandas
- Handle missing data with `fillna` function
- Apply descriptive statistics functions to analyze data
- Utilize Pandas for data slicing and dicing

## Prerequisites

- Basic understanding of Python
- Familiarity with Jupyter notebooks
- Installed libraries: numpy, pandas

## Get Started

### Install required packages.

In [1]:
%pip install numpy pandas


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.2.1[0m[39;49m -> [0m[32;49m24.3.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.


### Import necessary libraries

In [2]:
import numpy as np
import pandas as pd

## Convert list of dictionaries to DataFrame

In [3]:
d = [
    {"city": "Delhi", "data": 1000},
    {"city": "Bangalore", "data": 2000},
    {"city": "Mumbai", "data": 1000},
]
d

[{'city': 'Delhi', 'data': 1000},
 {'city': 'Bangalore', 'data': 2000},
 {'city': 'Mumbai', 'data': 1000}]

Convert the list of dictionaries `d` into a DataFrame.

In [4]:
df = pd.DataFrame(d)
df

Unnamed: 0,city,data
0,Delhi,1000
1,Bangalore,2000
2,Mumbai,1000


## Convert CSV files to DataFrame

Read in csv file and convert it to DataFrame.

In [5]:
city_data = pd.read_csv("../../Data/simplemaps-worldcities-basic.csv")

Show the first 10 rows of converted DataFrame.

In [6]:
city_data.head(n=10)

Unnamed: 0,city,city_ascii,lat,lng,pop,country,iso2,iso3,province
0,Qal eh-ye Now,Qal eh-ye,34.983,63.1333,2997.0,Afghanistan,AF,AFG,Badghis
1,Chaghcharan,Chaghcharan,34.516701,65.250001,15000.0,Afghanistan,AF,AFG,Ghor
2,Lashkar Gah,Lashkar Gah,31.582998,64.36,201546.0,Afghanistan,AF,AFG,Hilmand
3,Zaranj,Zaranj,31.112001,61.886998,49851.0,Afghanistan,AF,AFG,Nimroz
4,Tarin Kowt,Tarin Kowt,32.633298,65.866699,10000.0,Afghanistan,AF,AFG,Uruzgan
5,Zareh Sharan,Zareh Sharan,32.85,68.416705,13737.0,Afghanistan,AF,AFG,Paktika
6,Asadabad,Asadabad,34.866,71.150005,48400.0,Afghanistan,AF,AFG,Kunar
7,Taloqan,Taloqan,36.729999,69.540004,64256.0,Afghanistan,AF,AFG,Takhar
8,Mahmud-E Eraqi,Mahmud-E Eraqi,35.016696,69.333301,7407.0,Afghanistan,AF,AFG,Kapisa
9,Mehtar Lam,Mehtar Lam,34.65,70.166701,17345.0,Afghanistan,AF,AFG,Laghman


## Data Access

### Head and Tail

Get the last 10 rows of `city_data`:

In [7]:
city_data.tail(10)

Unnamed: 0,city,city_ascii,lat,lng,pop,country,iso2,iso3,province
7312,Karoi,Karoi,-16.819556,29.679987,13194.0,Zimbabwe,ZW,ZWE,Mashonaland West
7313,Chinhoyi,Chinhoyi,-17.359626,30.180008,52812.0,Zimbabwe,ZW,ZWE,Mashonaland West
7314,Kariba,Kariba,-16.5296,28.80004,23133.5,Zimbabwe,ZW,ZWE,Mashonaland West
7315,Hwange,Hwange,-18.370004,26.500026,33599.5,Zimbabwe,ZW,ZWE,Matabeleland North
7316,Gweru,Gweru,-19.450041,29.82003,164715.5,Zimbabwe,ZW,ZWE,Midlands
7317,Mutare,Mutare,-18.970019,32.650038,216785.0,Zimbabwe,ZW,ZWE,Manicaland
7318,Kadoma,Kadoma,-18.330006,29.909947,56400.0,Zimbabwe,ZW,ZWE,Mashonaland West
7319,Chitungwiza,Chitungwiza,-18.000001,31.100003,331071.0,Zimbabwe,ZW,ZWE,Harare
7320,Harare,Harare,-17.81779,31.044709,1557406.5,Zimbabwe,ZW,ZWE,Harare
7321,Bulawayo,Bulawayo,-20.169998,28.580002,697096.0,Zimbabwe,ZW,ZWE,Bulawayo


### Slicing and Dicing

In [8]:
series_es = city_data.lat
type(series_es)

pandas.core.series.Series

Get the first 5 odd number of rows of `series_es`:

In [9]:
series_es[1:10:2]

1    34.516701
3    31.112001
5    32.850000
7    36.729999
9    34.650000
Name: lat, dtype: float64

Get the first 8 rows of `series_es`:

In [10]:
series_es[:8]

0    34.983000
1    34.516701
2    31.582998
3    31.112001
4    32.633298
5    32.850000
6    34.866000
7    36.729999
Name: lat, dtype: float64

Get first 8 rows of `city_data`:

In [11]:
city_data[:8]

Unnamed: 0,city,city_ascii,lat,lng,pop,country,iso2,iso3,province
0,Qal eh-ye Now,Qal eh-ye,34.983,63.1333,2997.0,Afghanistan,AF,AFG,Badghis
1,Chaghcharan,Chaghcharan,34.516701,65.250001,15000.0,Afghanistan,AF,AFG,Ghor
2,Lashkar Gah,Lashkar Gah,31.582998,64.36,201546.0,Afghanistan,AF,AFG,Hilmand
3,Zaranj,Zaranj,31.112001,61.886998,49851.0,Afghanistan,AF,AFG,Nimroz
4,Tarin Kowt,Tarin Kowt,32.633298,65.866699,10000.0,Afghanistan,AF,AFG,Uruzgan
5,Zareh Sharan,Zareh Sharan,32.85,68.416705,13737.0,Afghanistan,AF,AFG,Paktika
6,Asadabad,Asadabad,34.866,71.150005,48400.0,Afghanistan,AF,AFG,Kunar
7,Taloqan,Taloqan,36.729999,69.540004,64256.0,Afghanistan,AF,AFG,Takhar


Get the first 4 columns of the first 5 rows of **city_data**:

In [12]:
city_data.iloc[:5, :4]

Unnamed: 0,city,city_ascii,lat,lng
0,Qal eh-ye Now,Qal eh-ye,34.983,63.1333
1,Chaghcharan,Chaghcharan,34.516701,65.250001
2,Lashkar Gah,Lashkar Gah,31.582998,64.36
3,Zaranj,Zaranj,31.112001,61.886998
4,Tarin Kowt,Tarin Kowt,32.633298,65.866699


Select cities that have population of more than 10 million and select columns with column name start with the letter `p`:

In [13]:
city_data[city_data["pop"] > 10000000][
    city_data.columns[pd.Series(city_data.columns).str.startswith("p")]
]

Unnamed: 0,pop,province
360,11862073.0,Ciudad de Buenos Aires
1171,14433147.5,São Paulo
2068,14797756.0,Shanghai
3098,11779606.5,Delhi
3110,15834918.0,Maharashtra
3492,22006299.5,Tokyo
4074,14919501.0,Distrito Federal
4513,11877109.5,Sind
5394,10452000.0,Moskva
6124,10003305.0,Istanbul


## Data Operations

### Missing data and the `fillna` function

In [14]:
df = pd.DataFrame(np.random.randn(8, 3), columns=["A", "B", "C"])
df.iloc[4, 2] = np.nan
df

Unnamed: 0,A,B,C
0,0.793796,0.548567,-0.510938
1,0.390598,0.309418,-0.724149
2,0.22363,0.349096,0.981475
3,0.939356,1.134944,-0.832401
4,1.18932,0.915207,
5,-0.515254,3.000557,-1.082148
6,-0.631578,-0.430122,-0.713525
7,1.10708,-0.650253,0.655335


Replace all the "NaN" in `df` with `0`:

In [15]:
df.fillna(0)

Unnamed: 0,A,B,C
0,0.793796,0.548567,-0.510938
1,0.390598,0.309418,-0.724149
2,0.22363,0.349096,0.981475
3,0.939356,1.134944,-0.832401
4,1.18932,0.915207,0.0
5,-0.515254,3.000557,-1.082148
6,-0.631578,-0.430122,-0.713525
7,1.10708,-0.650253,0.655335


## Descriptive Statistics functions

In [16]:
columns_numeric = ["lat", "lng", "pop"]

Get average `lat`, `lng`, and `pop` values:

In [17]:
city_data[columns_numeric].mean()

lat        20.662876
lng        10.711914
pop    265463.071633
dtype: float64

Get sum of `lat`, `lng`, and `pop` values:

In [18]:
city_data[columns_numeric].sum()

lat    1.512936e+05
lng    7.843263e+04
pop    1.943721e+09
dtype: float64

Get total number of `lat`, `lng`, and `pop` values:

In [19]:
city_data[columns_numeric].count()

lat    7322
lng    7322
pop    7322
dtype: int64

Get 75 percentile of `lat`, `lng`, and `pop` values:

In [20]:
city_data[columns_numeric].quantile(0.75)

lat        43.575448
lng        73.103628
pop    200172.625000
Name: 0.75, dtype: float64

Get sums of each row:

In [21]:
city_data[columns_numeric].sum(axis=1)

0       3.095116e+03
1       1.509977e+04
2       2.016419e+05
3       4.994400e+04
4       1.009850e+04
            ...     
7317    2.167987e+05
7318    5.641158e+04
7319    3.310841e+05
7320    1.557420e+06
7321    6.971044e+05
Length: 7322, dtype: float64

Calculate
the most important statistics for numerical data in one go so that we don’t have to use individual functions:

In [22]:
city_data[columns_numeric].describe()

Unnamed: 0,lat,lng,pop
count,7322.0,7322.0,7322.0
mean,20.662876,10.711914,265463.1
std,29.134818,79.044615,828762.2
min,-89.982894,-179.589979,-99.0
25%,-0.32471,-64.788472,17344.25
50%,26.79273,18.617509,61322.75
75%,43.575448,73.103628,200172.6
max,82.483323,179.383304,22006300.0


## Conclusion

In this module, you've learned how to:

- Convert different data formats to Pandas DataFrames
- Access and manipulate data using Pandas
- Handle missing data
- Perform basic statistical analysis on datasets
- Use various Pandas functions for data exploration and manipulation

These skills form a foundation for more advanced data analysis and machine learning tasks using Python and Pandas.

## Clean up

Remember to shut down your Jupyter notebook kernel when you're done to free up resources.