<a href="https://colab.research.google.com/github/sprince0031/ICT-Python-ML/blob/main/Week%203/Notebooks/Week3.ipynb" target="_blank"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Python & ML Foundations: Session 3 Solutions
## NumPy & Pandas

Welcome to the session 3 tutorial and companion notebook! This week, we'll learn to do basic mathematical operations of n-dimensional arrays using NumPy and pre-process a real-world housing dataset that's available in colab's sample folder.

## Utility code
The below code cell contains any common imports or sample data that can be useful for your exercises. Make sure to run this cell first before starting your exercises!

In [None]:
import pandas as pd
import numpy as np

---
# Video Challenges

## 1. NumPy

This assignment tests your ability to use NumPy to prepare a dataset for a machine learning task. Use `np.random.seed(42)` to ensure your results are reproducible.

### Create a Dataset
Create a 10x4 NumPy array representing 10 houses with the following columns:
* **Square Footage:** Random integers from `1000` to `3000`
* **Bedrooms:** Random integers from `2` to `6`
* **House Age:** Random integers from `1` to `50`
* **House Price:** Random floats from `150.0` to `750.0`

Print the resulting array

### Filter the Data
Using boolean indexing, create a new array containing only the houses with more than 3 bedrooms and an age of less than 20 years.

Print the filtered array and its `.shape`.


In [None]:
# Your code here

# Create numpy 10x1 arrays for each of the columns
sq_footage = np.random.randint(1000, 3001, size = (10,1))
bedrooms = np.random.randint(2,7, size = (10,1))
house_age = np.random.randint(1,51, size = (10,1))
house_price = np.random.uniform(150.0, 750.1, size=(10,1))

# Colate the arrays into a single 10x4 array
house_data = np.hstack((sq_footage, bedrooms, house_age, house_price))

print(f'Full house data:\n{house_data}')

In [None]:
# Filter data
more_than_3_bedrooms = house_data[:,1] > 3
age_less_than_20 = house_data[:,2] < 20

combined_filter = more_than_3_bedrooms & age_less_than_20

filtered_houses = house_data[combined_filter]

# Print filtered numpy array and its shape
print(f'Filtered house data:\n{filtered_houses}')
print(f'Shape of filtered data: {filtered_houses.shape}')

## 2. Pandas

**Run the Data Corruption Code:**

It will load the test dataset and introduce several common data quality issues.

In [None]:
# --- RUN THIS CODE FIRST ---
# This code will create a corrupted DataFrame for you to fix.

# Load the test dataset
test_path = '/content/sample_data/california_housing_test.csv'
df_challenge = pd.read_csv(test_path)

# Introduce NaN values
df_challenge.loc[[5, 20, 50], 'total_rooms'] = np.nan
df_challenge.loc[[10, 30, 60], 'population'] = np.nan

# Introduce extreme outliers
df_challenge.loc[[100, 200], 'housing_median_age'] = 999

# Introduce incorrect data type
df_challenge['households'] = df_challenge['households'].astype(str)

print("--- Corrupted Dataset is Ready ---")
df_challenge.info()

**Clean the Data**

Now, write the code to fix the `df_challenge` DataFrame.

 * **Inspect the Damage:** After running the code above, use `.info()` and `.describe()` to identify all the problems that were introduced.

 * **Fix Data Types:**  The households column was incorrectly converted to an object (string). Convert it back to a numeric type. Hint: Use `pd.to_numeric()`

 * **Handle Missing Values:** Fill the `Nan` values in `total_rooms` and population with their respective means.

 * **Handle Outliers:** The `housing_median_age` column now has unrealistic values. Filter the DataFrame to remove any rows where `housing_median_age` is greater than `90`.

 Print the `.info()` and `.describe()` outputs of your final, cleaned DataFrame to prove that all the issues have been resolved.

In [None]:
# Finding data types of all columns
df_challenge.info()

In [None]:
# Finding counts and stats of columns
df_challenge.describe()

In [None]:
# Fixing incorrect datatype of `object`(string) to `float`
df_challenge['households'] = df_challenge['households'].astype(float)

# Verifying the datatype change
df_challenge.info()

In [None]:
# Plugging missing values with mean of respective columns
total_rooms_mean = df_challenge['total_rooms'].mean()
population_mean = df_challenge['population'].mean()

df_challenge.fillna({'total_rooms': total_rooms_mean, 'population': population_mean}, inplace = True)

# Verifying that the values have been filled
df_challenge.info()

In [None]:
# Checking for outliers in `housing_median_age` i.e., >= 90 years of age which would be rare and implausible
print(df_challenge[df_challenge['housing_median_age']>=90])

# Filtering out the outliers
df_filtered = df_challenge[df_challenge['housing_median_age'] < 90]

df_filtered.info()

In [None]:
# Verifying the outliers have been removed
print(df_filtered[df_filtered['housing_median_age']>=90])

---
# Practice Challenges

## Video Game Sales Analysis

This exercise will guide you through a basic data analysis project to practice your data loading, cleaning, and exploration skills using pandas for data manipulation and numpy for numerical calculations.

 We will be using the Video Game Sales dataset from Kaggle.
 * You can download it here: https://www.kaggle.com/datasets/gregorut/videogamesales


### Step 1: Setup and Data Loading

 Make sure you have uploaded the `vgsales.csv` file to your Colab environment.

* Load the `vgsales.csv` file into a pandas DataFrame named df.
* Display the first 5 rows of the DataFrame using `df.head()` to ensure it's loaded correctly.


In [None]:
file_path = 'vgsales.csv'
df = pd.read_csv(file_path)

# Display the first few rows
print("First 5 rows of the dataset:")
df.head()

### Step 2: Data Cleaning and Initial Exploration

* Use `df.info()` and `df.isnull().sum()` to check for missing values in each column.
* Remove the rows with missing values using `df.dropna(inplace = True)`.
* The `Year` column is a float but should be an integer. Covert it to int.
* Print the info again to confirm your changes.

In [None]:
# Check for missing values
print("Missing values before cleaning:")
print(df.isnull().sum())

# Remove rows with any missing values
df.dropna(inplace=True)

# Convert 'Year' from float to integer
df['Year'] = df['Year'].astype(int)

# Check the info after cleaning
print("\nDataset info after cleaning:")
df.info()


### Step 3: Answering Analytical Questions

**Question 1:  Which publisher has released the most games?**

***Hint:*** Use the .value_counts() method on the Publisher column.

**Question 2: Which genre sold the most copies in North America?**

***Hint:***  Use .groupby() on the Genre column, select the NA_Sales column, and calculate the .sum().

**Question 3: What is the standard deviation of global sales for the "Action" genre?**

***Hint:***  First, filter the DataFrame to get only "Action" games. Then, select the Global_Sales column and use the .std() method (which uses numpy underneath).

In [None]:
# Question 1: Which publisher has released the most games?
most_prolific_publisher = df['Publisher'].value_counts().idxmax()
print(f"The publisher with the most released games is: {most_prolific_publisher}")

In [None]:
# Question 2: Which genre has the highest total sales in North America?
genre_sales_na = df.groupby('Genre')['NA_Sales'].sum().sort_values(ascending=False)
print("\nTotal sales in North America by genre:")
print(genre_sales_na)

In [None]:
# Question 3: What is the standard deviation of global sales for Action games?
action_games = df[df['Genre'] == 'Action']
action_sales_std = action_games['Global_Sales'].std()
print(f"\nThe standard deviation of global sales for Action games is: {action_sales_std:.2f} million")

## Bonus Challenge

Calculate the total global sales for each platform and determine which platform has the highest market share (percentage of total global sales).

In [None]:
# Calculate total sales per platform
platform_sales = df.groupby('Platform')['Global_Sales'].sum().sort_values(ascending=False)
print("\nTotal global sales by platform:")
print(platform_sales)

# Calculate total global sales for all games
total_global_sales = df['Global_Sales'].sum()

# Calculate market share
market_share = (platform_sales / total_global_sales) * 100
print("\nMarket share by platform (%):")
print(market_share)