# Pandas Exercises (Solution)

## Overview

This module covers essential Pandas operations, including data manipulation, analysis, and basic statistical functions. It provides hands-on experience with real-world data using the Pandas library.

## Learning Objectives

- Convert list of dictionaries and CSV files to DataFrames
- Perform data access operations using Pandas
- Handle missing data with `fillna` function
- Apply descriptive statistics functions to analyze data
- Utilize Pandas for data slicing and dicing

## Prerequisites

- Basic understanding of Python
- Familiarity with Jupyter notebooks
- Installed libraries: numpy, pandas

## Get Started

### Import necessary libraries

In [None]:
# Importing the numpy library, which provides support for large, multi-dimensional arrays and matrices
# It also provides mathematical functions to operate on these arrays
import numpy as np

# Importing the pandas library, which is a powerful, flexible, and easy-to-use data manipulation and analysis tool
# It provides data structures such as DataFrame and Series to handle and analyze structured data
import pandas as pd

## Convert list of dictionaries to DataFrame

In [None]:
# List of dictionaries containing city names and associated data values
d = [
    # First city: Delhi with associated data value 1000
    {"city": "Delhi", "data": 1000},
    
    # Second city: Bangalore with associated data value 2000
    {"city": "Bangalore", "data": 2000},
    
    # Third city: Mumbai with associated data value 1000
    {"city": "Mumbai", "data": 1000},
]

# Output the list of dictionaries
d

Convert the list of dictionaries `d` into a DataFrame.

In [None]:
# Create a pandas DataFrame from the dictionary 'd'
df = pd.DataFrame(d)

# Display the DataFrame
df

## Convert CSV files to DataFrame

Read in csv file and convert it to DataFrame.

In [None]:
# Read the CSV file containing city data into a pandas DataFrame
# The path to the CSV file is provided as relative from the current working directory
city_data = pd.read_csv("../../Data/simplemaps-worldcities-basic.csv")

Show the first 10 rows of converted DataFrame.

In [None]:
# Display the first 10 rows of the 'city_data' DataFrame
city_data.head(n=10)

## Data Access

### Head and Tail

Get the last 10 rows of `city_data`:

In [None]:
# The .tail(10) method retrieves the last 10 rows of the DataFrame
# This is useful for quickly inspecting the end of the dataset
city_data.tail(10)

### Slicing and Dicing

In [None]:
# Extract the 'lat' column from the DataFrame as a Pandas Series
# This creates a one-dimensional labeled array with the latitude values
series_es = city_data.lat

# Use the type() function to determine the data type of 'series_es'
# This will return <class 'pandas.core.series.Series'>, confirming it's a Pandas Series
type(series_es)

Get the first 5 odd number of rows of `series_es`:

In [None]:
# Slice the `series_es` object to extract elements from index 1 to 10 (exclusive) with a step of 2
# This means selecting every second element starting from index 1 up to (but not including) index 10
series_es[1:10:2]

Get the first 8 rows of `series_es` using Python list slicing:

In [None]:
# Slice the `series_es` object to extract elements from the beginning up to index 8 (exclusive)
# This selects all elements from index 0 to index 7
series_es[:8]

Get first 8 rows of `city_data` using Python list slicing:

In [None]:
# Slice the `city_data` array to extract the first 8 rows
# This selects all rows from index 0 to index 7 (inclusive)
city_data[:8]

Get the first 4 columns of the first 5 rows of **city_data**:

In [None]:
# Use `.iloc` to slice the `city_data` DataFrame
# `.iloc` is used for integer-location based indexing
# `:5` selects the first 5 rows (indices 0 to 4)
# `:4` selects the first 4 columns (indices 0 to 3)
city_data.iloc[:5, :4]

Select cities that have population of more than 10 million and select columns with column name start with the letter `p`:

In [None]:
# Filter the city_data dataframe to include cities with a population greater than 10 million
# The condition city_data["pop"] > 10000000 selects rows where the population exceeds 10 million
# Then, select only the columns whose names start with "p"
# This is achieved using the str.startswith("p") function on the column names
city_data[city_data["pop"] > 10000000][
    city_data.columns[pd.Series(city_data.columns).str.startswith("p")]
]

## Data Operations

### Missing data and the `fillna` function

In [None]:
# Create a DataFrame with 8 rows and 3 columns, filled with random numbers from a normal distribution
df = pd.DataFrame(np.random.randn(8, 3), columns=["A", "B", "C"])

# Set a specific value (at row 4, column 'C') to NaN (missing value)
df.iloc[4, 2] = np.nan

# Display the DataFrame
df

Replace all the "NaN" in `df` with `0`:

In [None]:
# Replace all missing values (NaN) in the DataFrame `df` with 0
# `fillna(0)` fills NaN values with the specified value (in this case, 0)
df_filled = df.fillna(0)

# Display the DataFrame with missing values replaced
df_filled

## Descriptive Statistics functions

In [None]:
# Define a list of column names that contain numeric data
# This list can be used to select or manipulate specific columns in a DataFrame
columns_numeric = ["lat", "lng", "pop"]

Get average `lat`, `lng`, and `pop` values of `city_data`:

In [None]:
# Calculate the mean (average) of the numeric columns in `city_data` specified by `columns_numeric`
# `columns_numeric` is a list of column names containing numeric data (e.g., ["lat", "lng", "pop"])
# The `.mean()` function computes the mean for each column
mean_values = city_data[columns_numeric].mean()

# Display the mean values for the specified columns
mean_values

Get sum of `lat`, `lng`, and `pop` values of `city_data`:

In [None]:
# Calculate the sum of the numeric columns in `city_data` specified by `columns_numeric`
# `columns_numeric` is a list of column names containing numeric data (e.g., ["lat", "lng", "pop"])
# The `.sum()` function computes the sum for each column
sum_values = city_data[columns_numeric].sum()

# Display the sum values for the specified columns
sum_values

Get total number of `lat`, `lng`, and `pop` values of `city_data`:

In [None]:
# Count the number of non-missing (non-NaN) values in the numeric columns of `city_data` specified by `columns_numeric`
# `columns_numeric` is a list of column names containing numeric data (e.g., ["lat", "lng", "pop"])
# The `.count()` function counts the number of non-missing values for each column
count_values = city_data[columns_numeric].count()

# Display the count of non-missing values for the specified columns
count_values

Get 75 percentile of `lat`, `lng`, and `pop` values of `city_data`:

In [None]:
# Calculate the 75th percentile (third quartile) for the numeric columns in `city_data` specified by `columns_numeric`
# `columns_numeric` is a list of column names containing numeric data (e.g., ["lat", "lng", "pop"])
# The `.quantile(0.75)` function computes the 75th percentile for each column
quantile_values = city_data[columns_numeric].quantile(0.75)

# Display the 75th percentile values for the specified columns
quantile_values

Get sums of each row of `city_data`:

In [None]:
# Calculate the sum of numeric columns in `city_data` specified by `columns_numeric` row-wise (across columns)
# `columns_numeric` is a list of column names containing numeric data (e.g., ["lat", "lng", "pop"])
# The `.sum(axis=1)` function computes the sum for each row (axis=1 refers to columns)
row_sums = city_data[columns_numeric].sum(axis=1)

# Display the sum of numeric values for each row
row_sums

Calculate
the most important statistics for numerical data in `city_data` in one go so that we don’t have to use individual functions:

In [None]:
# Generate descriptive statistics for the numeric columns in `city_data` specified by `columns_numeric`
# `columns_numeric` is a list of column names containing numeric data (e.g., ["lat", "lng", "pop"])
# The `.describe()` function computes summary statistics for each column
summary_statistics = city_data[columns_numeric].describe()

# Display the summary statistics for the specified columns
summary_statistics

## Conclusion

In this module, you've learned how to:

- Convert different data formats to Pandas DataFrames
- Access and manipulate data using Pandas
- Handle missing data
- Perform basic statistical analysis on datasets
- Use various Pandas functions for data exploration and manipulation

These skills form a foundation for more advanced data analysis and machine learning tasks using Python and Pandas.

## Clean up

Remember to shut down your Jupyter notebook kernel when you're done to free up resources.