# Machine Learning

#### Q-1. Imagine you have a dataset where you have different Instagram features like u sername , Caption , Hashtag , Followers , Time_Since_posted , and likes , now your task is to predict the number of likes and Time Since posted and the rest of the features are your input features. Now you have to build a model which can predict the number of likes and Time Since posted. This is the Dataset You can use this dataset for this question.
Dataset-https://www.kaggle.com/datasets/rxsraghavagrawal/instagram-reach?resource=download

In [3]:
import pandas as pd

data = pd.read_csv('instagram_reach.csv')

from sklearn.model_selection import train_test_split

# Split the dataset into training and testing sets
X = pd.get_dummies(data[['USERNAME', 'Caption', 'Hashtags', 'Followers', 'Time since posted']])
y_likes = data['Likes']
y_time_since_posted = data['Time since posted']

X_train, X_test, y_likes_train, y_likes_test, y_time_train, y_time_test = train_test_split(X, y_likes, y_time_since_posted, test_size=0.2, random_state=42)

from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score

# Initialize and train the model
model_likes = LinearRegression()
model_likes.fit(X_train, y_likes_train)

# Make predictions on the testing data
likes_predictions = model_likes.predict(X_test)

# Evaluate the model
mse = mean_squared_error(y_likes_test, likes_predictions)
mae = mean_absolute_error(y_likes_test, likes_predictions)
r2 = r2_score(y_likes_test, likes_predictions)
print('Mean Squared Error (MSE):', mse)
print('Mean Absolute Error (MAE):', mae)
print('R-squared Score:', r2)

import joblib

# Save the trained model
joblib.dump(model_likes, 'likes_prediction_model.pkl')

Mean Squared Error (MSE): 1038.0575042696987
Mean Absolute Error (MAE): 22.718871322012863
R-squared Score: 0.2677501640806921


['likes_prediction_model.pkl']

#### Q-4. Imagine you working as a sale manager now you need to predict the Revenue and whether that particular revenue is on the weekend or not and find the Informational_Duration using the Ensemble learning algorithm
Dataset-https://www.kaggle.com/datasets/henrysue/online-shoppers-intention

This is the Dataset You can use this dataset for this question.

In [5]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor
from sklearn.metrics import accuracy_score, classification_report, mean_squared_error

# Load the dataset
data = pd.read_csv('online_shoppers_intention.csv')

# Preprocessing
# Drop irrelevant columns
data = data.drop(['Month', 'OperatingSystems', 'Browser', 'Region', 'TrafficType'], axis=1)

# Split the dataset into features (X) and target variables (y_revenue, y_informational_duration)
X = data.drop(['Revenue', 'Informational_Duration'], axis=1)
y_revenue = data['Revenue']
y_informational_duration = data['Informational_Duration']

# Convert categorical variables to numerical using one-hot encoding
X = pd.get_dummies(X, drop_first=True)

# Split the data into training and testing sets
X_train, X_test, y_train_revenue, y_test_revenue, y_train_informational_duration, y_test_informational_duration = train_test_split(
    X, y_revenue, y_informational_duration, test_size=0.2, random_state=42
)

# Ensemble Learning - Random Forest Classifier for revenue prediction
rf_classifier_revenue = RandomForestClassifier(n_estimators=100, random_state=42)
rf_classifier_revenue.fit(X_train, y_train_revenue)

# Predict revenue for the test set
y_pred_revenue = rf_classifier_revenue.predict(X_test)

# Evaluation for revenue prediction
accuracy_revenue = accuracy_score(y_test_revenue, y_pred_revenue)
classification_report_revenue = classification_report(y_test_revenue, y_pred_revenue)

print("Revenue Prediction Accuracy:", accuracy_revenue)
print("Revenue Prediction Classification Report:")
print(classification_report_revenue)

# Ensemble Learning - Random Forest Regressor for informational duration prediction
rf_regressor_informational_duration = RandomForestRegressor(n_estimators=100, random_state=42)
rf_regressor_informational_duration.fit(X_train, y_train_informational_duration)

# Predict informational duration for the test set
y_pred_informational_duration = rf_regressor_informational_duration.predict(X_test)

# Evaluation for informational duration prediction
mse_informational_duration = mean_squared_error(y_test_informational_duration, y_pred_informational_duration)

print("Mean Squared Error for Informational Duration:", mse_informational_duration)

# Print the predicted revenue and informational duration
results_df = pd.DataFrame({
    'Actual Revenue': y_test_revenue,
    'Predicted Revenue': y_pred_revenue,
    'Actual Informational Duration': y_test_informational_duration,
    'Predicted Informational Duration': y_pred_informational_duration
})
print("Predicted Results:")
print(results_df)

Revenue Prediction Accuracy: 0.8888888888888888
Revenue Prediction Classification Report:
              precision    recall  f1-score   support

       False       0.91      0.96      0.94      2055
        True       0.74      0.51      0.61       411

    accuracy                           0.89      2466
   macro avg       0.82      0.74      0.77      2466
weighted avg       0.88      0.89      0.88      2466

Mean Squared Error for Informational Duration: 13995.616144280026
Predicted Results:
       Actual Revenue  Predicted Revenue  Actual Informational Duration  \
8916            False              False                           0.00   
772              True              False                         235.55   
12250           False              False                           0.00   
7793            False              False                           0.00   
6601            False               True                         733.80   
...               ...                ...        