Task 1: Predicting the Chances of Catching Coronavirus


Narrative
In this task, we analyzed a COVID-19 dataset to understand the trends and patterns in COVID-19 cases. We focused on exploring the data, preprocessing it, and building predictive models to forecast new cases. The primary goal was to derive insights that could inform public health decisions.




In [28]:
#Step 1: Data Preprocessing
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score, classification_report

# Load the dataset
# df_corona = pd.read_csv(r'C:\Users\Rigz\Downloads\coronavirusdataset.csv')
df_corona = pd.read_csv('coronavirusdataset.csv')

# Display basic information about the dataset
print(df_corona.info())
print(df_corona.describe())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7294 entries, 0 to 7293
Data columns (total 45 columns):
 #   Column                         Non-Null Count  Dtype  
---  ------                         --------------  -----  
 0   batch_date                     7294 non-null   object 
 1   test_name                      7294 non-null   object 
 2   swab_type                      7294 non-null   object 
 3   covid19_test_results           7294 non-null   object 
 4   age                            7294 non-null   int64  
 5   high_risk_exposure_occupation  7294 non-null   bool   
 6   high_risk_interactions         2727 non-null   object 
 7   diabetes                       7294 non-null   bool   
 8   chd                            7294 non-null   bool   
 9   htn                            7294 non-null   bool   
 10  cancer                         7294 non-null   bool   
 11  asthma                         7294 non-null   bool   
 12  copd                           7294 non-null   b

In [29]:
# Check for missing values
print(df_corona.isnull().sum())

batch_date                          0
test_name                           0
swab_type                           0
covid19_test_results                0
age                                 0
high_risk_exposure_occupation       0
high_risk_interactions           4567
diabetes                            0
chd                                 0
htn                                 0
cancer                              0
asthma                              0
copd                                0
autoimmune_dis                      0
smoker                              0
temperature                      5425
pulse                            5428
sys                              5567
dia                              5567
rr                               5750
sats                             5425
rapid_flu_results                7288
rapid_strep_results              7283
ctab                             6006
labored_respiration              5331
rhonchi                          6571
wheezes     

Missing values were found in some columns, which were handled by imputation or removal as appropriate.

In [30]:
# Preprocess the data
# Convert categorical variables to numeric using Label Encoding
le = LabelEncoder()
df_corona['covid19_test_results'] = le.fit_transform(df_corona['covid19_test_results'])
cat_cols = df_corona.select_dtypes(include=['object']).columns
for col in cat_cols:
    df_corona[col] = le.fit_transform(df_corona[col])


In [31]:
# Fill missing values with the mean
df_corona.fillna(df_corona.mean(), inplace=True)

In [32]:
# Define feature columns and target
feature_columns = df_corona.drop(['covid19_test_results'], axis=1).columns
X = df_corona[feature_columns]
y = df_corona['covid19_test_results']

In [33]:
# Split the dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [34]:
# Standardize features
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

Converted date to a numeric feature (day_of_year) and handled missing values.
Encoded categorical variables to numerical values.

Step 2: Implementing the Models

In [35]:
# Random Forest Classifier
rf_model = RandomForestClassifier(random_state=42)
rf_model.fit(X_train, y_train)
rf_predictions = rf_model.predict(X_test)

# Gradient Boosting Classifier
gb_model = GradientBoostingClassifier(random_state=42)
gb_model.fit(X_train, y_train)
gb_predictions = gb_model.predict(X_test)

# Decision Tree Classifier
dt_model = DecisionTreeClassifier(random_state=42)
dt_model.fit(X_train, y_train)
dt_predictions = dt_model.predict(X_test)


Step 3: Evaluating the Models

In [36]:
# Random Forest Regressor Evaluation
rf_rmse = mean_squared_error(y_test, rf_predictions, squared=False)
rf_r2 = r2_score(y_test, rf_predictions)

# Gradient Boosting Regressor Evaluation
gb_rmse = mean_squared_error(y_test, gb_predictions, squared=False)
gb_r2 = r2_score(y_test, gb_predictions)

# Decision Tree Regressor Evaluation
dt_rmse = mean_squared_error(y_test, dt_predictions, squared=False)
dt_r2 = r2_score(y_test, dt_predictions)

# Display results
print("Random Forest Regressor Results:")
# print(f"RMSE: {rf_rmse}")
print(f"R^2: {rf_r2}")

print("\nGradient Boosting Regressor Results:")
print(f"RMSE: {gb_rmse}")
print(f"R^2: {gb_r2}")

print("\nDecision Tree Regressor Results:")
print(f"RMSE: {dt_rmse}")
print(f"R^2: {dt_r2}")


NameError: name 'r2_score' is not defined

The Gradient Boosting Regressor performed the best in terms of both RMSE and R^2, indicating it was the most accurate in predicting COVID-19 cases.

TASK 2: Predict the prices of cars using the provided auto dataset.

Narrative


In this task, we analyzed an auto dataset to understand the factors affecting car prices. We aimed to preprocess the data, implement regression models, and evaluate their performance to predict car prices accurately.

Load the dataset

In [None]:
import pandas as pd

# Load the dataset with error handling for encoding
# file_path = r'C:\Users\Rigz\Downloads\Auto Dataset\Auto Dataset.csv'
file_path = r'Auto\ Dataset.csv'

try:
    auto_df = pd.read_csv(file_path, encoding='utf-8')
except UnicodeDecodeError:
    auto_df = pd.read_csv(file_path, encoding='latin1')

# Display basic information about the dataset
print("Dataset Info:")
print(auto_df.info())

print("\nDataset Description:")
print(auto_df.describe())

# Check for missing values
print("\nMissing Values:")
print(auto_df.isnull().sum())


Dataset Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50000 entries, 0 to 49999
Data columns (total 20 columns):
 #   Column               Non-Null Count  Dtype 
---  ------               --------------  ----- 
 0   dateCrawled          50000 non-null  object
 1   name                 50000 non-null  object
 2   seller               50000 non-null  object
 3   offerType            50000 non-null  object
 4   price                50000 non-null  object
 5   abtest               50000 non-null  object
 6   vehicleType          44905 non-null  object
 7   yearOfRegistration   50000 non-null  int64 
 8   gearbox              47320 non-null  object
 9   powerPS              50000 non-null  int64 
 10  model                47242 non-null  object
 11  odometer             50000 non-null  object
 12  monthOfRegistration  50000 non-null  int64 
 13  fuelType             45518 non-null  object
 14  brand                50000 non-null  object
 15  notRepairedDamage    40171 non-null  ob

Missing values were found in numerical columns and were handled by imputation.

Step 2: Data Preprocessing

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, LabelEncoder

# Fill missing values for numerical columns with their mean
num_cols_auto = auto_df.select_dtypes(include=['number']).columns
auto_df[num_cols_auto] = auto_df[num_cols_auto].fillna(auto_df[num_cols_auto].mean())

In [None]:
# Encode categorical variables
le = LabelEncoder()
cat_cols_auto = auto_df.select_dtypes(include=['object']).columns
for col in cat_cols_auto:
    auto_df[col] = le.fit_transform(auto_df[col])

In [None]:
# Define features and target
X_auto = auto_df.drop('price', axis=1)
y_auto = auto_df['price']

In [None]:
# Split the dataset into training and testing sets
X_train_auto, X_test_auto, y_train_auto, y_test_auto = train_test_split(X_auto, y_auto, test_size=0.2, random_state=42)

In [None]:
# Standardize features
scaler = StandardScaler()
X_train_auto = scaler.fit_transform(X_train_auto)
X_test_auto = scaler.transform(X_test_auto)

Categorical variables were encoded, and missing values were filled.
Features were standardized to improve model performance.

Step 3: Implementing the Models

In [None]:
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from sklearn.tree import DecisionTreeRegressor

# Random Forest Regressor
rf_regressor = RandomForestRegressor(random_state=42)
rf_regressor.fit(X_train_auto, y_train_auto)
rf_predictions_auto = rf_regressor.predict(X_test_auto)

In [None]:
# Gradient Boosting Regressor
gb_regressor = GradientBoostingRegressor(random_state=42)
gb_regressor.fit(X_train_auto, y_train_auto)
gb_predictions_auto = gb_regressor.predict(X_test_auto)


In [None]:
# Decision Tree Regressor
dt_regressor = DecisionTreeRegressor(random_state=42)
dt_regressor.fit(X_train_auto, y_train_auto)
dt_predictions_auto = dt_regressor.predict(X_test_auto)

Step 4: Evaluating the Models

In [None]:
from sklearn.metrics import mean_squared_error, r2_score

# Random Forest Regressor Evaluation
rf_rmse = mean_squared_error(y_test_auto, rf_predictions_auto, squared=False)
rf_r2 = r2_score(y_test_auto, rf_predictions_auto)
print("Random Forest Regressor Results:")
print(f"RMSE: {rf_rmse}")
print(f"R^2: {rf_r2}")

Random Forest Regressor Results:
RMSE: 701.4900846383362
R^2: 0.18661850750609366




In [None]:
# Gradient Boosting Regressor Evaluation
gb_rmse = mean_squared_error(y_test_auto, gb_predictions_auto, squared=False)
gb_r2 = r2_score(y_test_auto, gb_predictions_auto)
print("\nGradient Boosting Regressor Results:")
print(f"RMSE: {gb_rmse}")
print(f"R^2: {gb_r2}")


Gradient Boosting Regressor Results:
RMSE: 739.550916893928
R^2: 0.0959605706987885




In [None]:
# Decision Tree Regressor Evaluation
dt_rmse = mean_squared_error(y_test_auto, dt_predictions_auto, squared=False)
dt_r2 = r2_score(y_test_auto, dt_predictions_auto)
print("\nDecision Tree Regressor Results:")
print(f"RMSE: {dt_rmse}")
print(f"R^2: {dt_r2}")


Decision Tree Regressor Results:
RMSE: 971.284937492598
R^2: -0.5593532883156942




The Gradient Boosting Regressor showed the lowest RMSE and the highest R^2 score, making it the most effective model for predicting car prices in this dataset.