# Data Scientist Associate Practical Exam Submission

Use this template to complete your analysis and write up your summary for submission.


**Enhancing Clarity with Included Code**

To ensure clarity and a seamless learning experience, the notebook includes necessary code snippets. While the code's execution isn't obligatory for understanding the analysis, its presence enriches your insight into data preprocessing and manipulation.

This approach strikes a balance, providing clear takeaways while keeping technicalities accessible. Delve into the code at your convenience to enhance your understanding of the analysis's essence.

In [29]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

data = pd.read_csv('university_enrollment_2306.csv')

data['post_score'] = data['post_score'].fillna(0)
data['pre_requirement'] = data['pre_requirement'].fillna('None')

data['department'] = data['department'].replace('Math', 'Mathematics')

data['pre_score'] = data['pre_score'].replace('-', 0)
data['pre_score'] = data['pre_score'].astype(float)

data_rg = pd.get_dummies(data.drop(columns=['course_id', 'year']))

X = data_rg.drop(columns=['enrollment_count'])
y = data_rg[['enrollment_count']]

## Task 1 

The dataset contains **1850 rows and 8 columns** with missing values before cleaning. I have validated all the columns against the criteria in the dataset table:

- course_id: Same as description without missing values, 1850 unique course identifiers.
- course_type: Same as description without missing values, 2 course_types. "classroom" and "online".
- year: Same as description without missing values, years from 2011 to 2022 inclusive.
- enrollment_count: Same as description without missing values.
- pre_score: 130 missing values found, I replaced missing values with 0 and convert it into float64 type continuous data.
- post_score: 180+ missing values, so I replaced missing values with 0.
- pre_requirement: 890+ missing values found, so I replaced missing values with "None".
- department: Unidentified category `Math` was found, so I replaced all the `Math` associated values with `Mathematics`, since `Mathematics` is the only official category as per table. 

After the data validation, the dataset contains **1850 rows and 8 columns**.

## Task 2

From **The Count of Enrollments**, The majority of enrollments are around 260. There are fewer enrollments between 150 and 190, than between 240 and 265. No course has recorded enrollments between 190 and 229.

The distribution is roughly normal in the two groups captured. These groups may occur due to two distinct categories. Further exploration will clarify it.

![image](image.png)


## Task 3

From **The count of Course Types**, we an see that the most observations are hold by `online` course_type, nearly 1380, follows second category `classroom` having observations around 500.

The data distribution is imbalanced among types, comprising 74% of observations for the `online` course type and 26% for the `classroom` course type.

![image-4](image-4.png)


## Task 4
From **The Relationship between Course type and Enrollment Count**, we can see that the `classroom` has the lower number of enrollments between IQR range of 165 and 180 than the IQR range of `online` course type between 240 and 262. This clarifies the dsitrubtion of number of enrollments, which has captured two segments of count of enrollemnts.

![image-5](image-5.png)


**Make changes to enable modeling**

Finally, to enable model fitting, I have made the following changes:

- Remove the `course_id` column because it has unique values, so we won't use that feature.
- Convert all the categorical variables into dummies of numeric variables

## Task 5

Predicting the count of enrollments a course will get is a **regression problem** in machine learning.

In [8]:
#import ML models and peformance metrics
from sklearn.linear_model import LinearRegression 
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split 
from sklearn.metrics import mean_squared_error

In [24]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=123)

## Task 6

Baseline Model - Linear Regression Model

In [25]:
lr = LinearRegression()
lr.fit(X_train,y_train)
y_pred_lr =lr.predict(X_test)

## Task 7

Comparision Model - Random Forest Model

In [26]:
# Create Random Forest Regressor object
rf = RandomForestRegressor()
rf = rf.fit(X_train,y_train)
y_pred_rf = rf.predict(X_test)

## Task 8

I am selecting the **Linear Regression model** as a foundational choice due to its simplicity, ease of training, and interpretability. As a comparision model, I am opting for the **Random Forest model**, which is capable of capturing intricate relationships between input features and target variables due to its ability to handle greater complexity.

## Task 9

I'm opting for the **Root Mean Squared Error (RMSE)** as my evaluation metric. It's a widely used measure that offers ease of interpretation by utilizing the same unit as the target variable. RMSE calculates the square root of the average squared difference between predicted and actual values, providing insight into the overall accuracy of the model's predictions.

**Evaluating Linear Regression Model**

In [27]:
mse_l = mean_squared_error(y_test, y_pred_lr)
print(np.sqrt(mse_l))

0.30285939882636104


**Evaluating Random Forest Model**

In [28]:
mse_r = mean_squared_error(y_test, y_pred_rf)
print(np.sqrt(mse_r))

0.37171745464897493


## Task 10

A smaller RMSE value indicates the model has smaller errors in prediction.

Therefore, **Linear Regression model** is performing better at predicting the count of enrollments a course will get than Random Forest Model based on this metric.

## ✅  When you have finished...
- Publish your Workspace using the option on the left
- Check the published version of your report:
	- Can you see everything you want us to grade?
    - Are all the graphics visible?
- Review the grading rubric. Have you included everything that will be graded?
- Head back to the [Certification dashboard](https://app.datacamp.com/certification) to submit your practical exam