## TASK OVERVIEW
Task Overview
The tasks involved pre-processing two datasets, creating machine learning models for classification and regression, and saving the results and models. Here's a breakdown of each task and the observations from the datasets provided:

Pre-process the dataset "Bike_Sales.xlsx":

Objective: Exclude missing data and outliers.
Observation:
The dataset contains various columns, including categorical (e.g., Month, Country, State) and numerical (e.g., Customer Age, Order Quantity, Unit Cost).
The data is quite detailed with financial figures such as Unit Cost, Unit Price, Profit, Cost, and Revenue, which are crucial for analysis.
Save the transformed dataset:

Objective: Save the cleaned dataset as "Sentongo.xlsx".
Observation:
After cleaning, the dataset should be more reliable for training machine learning models, as it excludes rows with missing values and outliers.
Generate a machine learning model to classify "Age_Group":

Objective: Build a classifier to predict the age group of customers.
Observation:
"Age_Group" is a categorical variable with distinct categories such as Youth, Young Adults, Adults.
The Random Forest classifier was chosen for its robustness and ability to handle complex interactions between features.
Save the classifier model:

Objective: Save the classifier model as "age_predictor.pkl".
Observation:
The model can now be reused for predicting the age group of new customers based on similar data.
Generate a machine learning model to predict "Revenue":

Objective: Use the cleaned dataset for training and "Bike_sales_Uganda.xlsx" for testing.
Observation:
The target variable "Revenue" is continuous, making this a regression task.
The dataset "Bike_sales_Uganda.xlsx" should have similar features to ensure consistency in predictions.
Datasets Observations
Bike_Sales.xlsx:

Structure: The dataset includes columns such as Day, Month, Year, Customer Age, Age Group, Customer Gender, Country, State, Product Category, Sub Category, Product, Order Quantity, Unit Cost, Unit Price, Profit, Cost, and Revenue.
Data Types: A mix of categorical and numerical data. Financial figures are presented with currency symbols which need to be stripped for numerical analysis.
Issues: Potential for missing data and outliers which need to be addressed before model training.
Bike_sales_Uganda.xlsx:

Structure: Similar to the first dataset but specific to Uganda, with columns like Date, Day, Month, Year, Customer Age, Age Group, Customer Gender, Country, State, Product Category, Sub Category, Product, Order Quantity, Unit Cost, Unit Price, Profit, Cost, and Revenue.
Data Types: Also a mix of categorical and numerical data.
Issues: Data cleaning steps (like handling missing values and ensuring feature consistency with the training dataset) are crucial for accurate revenue predictions.
Key Takeaways
Pre-processing: Vital to remove missing values and outliers to ensure model accuracy and reliability.
Feature Engineering: One-hot encoding and standardization are necessary for handling categorical variables and scaling numerical features.
Model Selection: Random Forest for classification tasks due to its ability to handle various types of data and interactions; Linear Regression for predicting continuous variables like Revenue.
Model Evaluation: It's essential to evaluate models using metrics like classification reports for classifiers and mean squared error for regressors.
File Handling: Saving cleaned datasets and models allows for reproducibility and further analysis.

In [4]:
import pandas as pd
import numpy as np
from scipy import stats
import matplotlib.pyplot as plt
import seaborn as sns

In [5]:
# Load the datasets
bike_sales = pd.read_excel('Bike_Sales.xlsx')
bike_sales_uganda = pd.read_excel('Bike_sales_Uganda.xlsx')

In [6]:
bike_sales.head()

Unnamed: 0,Date,Day,Month,Year,Customer_Age,Age_Group,Customer_Gender,Country,State,Product_Category,Sub_Category,Product,Order_Quantity,Unit_Cost,Unit_Price,Profit,Cost,Revenue
0,2017-01-01,1,January,2017,17,Youth (<25),M,Canada,British Columbia,Bikes,Road Bikes,"Road-250 Red, 44",2,1519,2443,1848,3038,4886
1,2017-01-01,1,January,2017,23,Youth (<25),M,Australia,Victoria,Bikes,Mountain Bikes,"Mountain-200 Black, 46",2,1252,2295,2086,2504,4590
2,2017-01-01,1,January,2017,33,Young Adults (25-34),F,France,Yveline,Bikes,Road Bikes,"Road-150 Red, 48",2,2171,3578,2814,4342,7156
3,2017-01-01,1,January,2017,39,Adults (35-64),M,United States,Washington,Bikes,Road Bikes,"Road-550-W Yellow, 38",2,713,1120,814,1426,2240
4,2017-01-01,1,January,2017,42,Adults (35-64),M,United States,California,Bikes,Road Bikes,"Road-750 Black, 44",2,344,540,392,688,1080
