# Project 3: Modelling of Data

# Dataset

## Import Libraries

In [171]:
'''Import libraries for data analysis'''
import pandas as pd
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
import numpy as np
from sklearn.metrics import accuracy_score
from sklearn.ensemble import RandomForestClassifier

## Load Dataset

In [172]:
'''Load processed statistics and machine learning dataset'''
stats_ml_df = pd.read_csv('../datasets/stats_ml_processed_df.csv',index_col=False)

'''convert datetime from string type to datetime'''
stats_ml_df['datetime'] =  pd.to_datetime(stats_ml_df['datetime'])

In [173]:
'''view loaded dataframe'''
stats_ml_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1977 entries, 0 to 1976
Data columns (total 13 columns):
 #   Column                  Non-Null Count  Dtype              
---  ------                  --------------  -----              
 0   title                   1977 non-null   object             
 1   sub_text                1822 non-null   object             
 2   id                      1977 non-null   object             
 3   author                  1973 non-null   object             
 4   score                   1977 non-null   int64              
 5   upvote_ratio            1977 non-null   float64            
 6   comments_list           1977 non-null   object             
 7   datetime                1977 non-null   datetime64[ns, UTC]
 8   year                    1977 non-null   int64              
 9   month                   1977 non-null   int64              
 10  joined_data             1977 non-null   object             
 11  tokenize_join_comments  1977 non-null   obj

## Data Dictionary

|Feature |Type| Decription|
|---|---|---|
|title|string object|Title of Subreddit Posts|
|sub_text|string object|Additional description followed by Title. This field may be empty|
|id|string object|Unique subreddit id of each post|
|author|string object|author of posts|
|score|int|The number of upvotes for the submission|
|upvote_ratio|float|The percentage of upvotes from all votes on the submission|
|comments_list|string object|Extracted comments including the replies|
|datetime|datetime|datetime on when the post was posted.|
|year|int|year information extracted from datetime.|
|month|int|month information extracted from datetime.|
|joined_data|string object|joined_data containing information from title, sub_text and comments|
|tokenize_join_comments|string object|tokenised with stopped wordsremoved from joined_data column|
|is_ml|int|1 indicates that the content came from machine learning subreddit. 0 indicates that the content came from statistics subreddit.|

# (M1) Modelling (Baseline Model) | Using tokenize_join_comments

Modelling will be done on tokenize_join_comments for this prediction as the focus will be on how effective is the joined information of title, subtext and comments for the prediction on whether it is a Machine Learning or Statistics thread.

In [174]:
'''Instantiate X and y for prediction'''
X_extracted = stats_ml_df['tokenize_join_comments']
y_extracted = stats_ml_df['is_ml']

## Count Vectoriser Analysis

In [175]:
'''Create a CountVectorizer'''
vectorizer_cv = CountVectorizer()

'''Convert text into numerical features'''
X = vectorizer_cv.fit_transform(X_extracted)

### Logistic Regression - Count Vectoriser

In [176]:
'''Train Test Split data for modelling'''
X_train, X_test, y_train, y_test = train_test_split(X, y_extracted, test_size=0.2, random_state=42)

'''Instantiate Logistic Regression Model'''
log_reg_model_cv = LogisticRegression()

'''Train and Fit model'''
log_reg_model_cv.fit(X_train, y_train)

'''Get predictions on test set'''
predictions = log_reg_model_cv.predict(X_test)

'''Calculate accuracy score'''
log_accuracy = accuracy_score(y_test, predictions)

'''Calculate mean coefficient'''
coefficients_mean = log_reg_model_cv.coef_.mean()

print('Model Performance for Logistic Regression Count Vectoriser:')
print('accuracy: ', round(log_accuracy,3))
print('Coefficient Mean: ', coefficients_mean)


Model Performance for Logistic Regression Count Vectoriser:
accuracy:  0.972
Coefficient Mean:  0.00021273513982014475


### Random Forest- Count Vectoriser

In [231]:
'''Train Test Split data for modelling'''
X_train, X_test, y_train, y_test = train_test_split(X, y_extracted, test_size=0.2, random_state=42)

'''Instantiate Decision Tree Model'''
rf_model_cv = RandomForestClassifier()

'''Train and Fit model'''
rf_model_cv.fit(X_train, y_train)

'''Get predictions on test set'''
predictions = rf_model_cv.predict(X_test)

'''Get feature importance'''
importances_mean = rf_model_cv.feature_importances_.mean()

'''Calculate accuracy score'''
rf_accuracy = accuracy_score(y_test, predictions)
print('Model Performance for Random Forest  Count Vectoriser:')
print('accuracy: ', rf_accuracy)
print('importance', importances_mean)


Model Performance for Random Forest  Count Vectoriser:
accuracy:  0.9722222222222222
importance 2.571619606027877e-05


## TFIDVectoriser Analysis

In [195]:
'''Create a CountVectorizer'''
vectorizer_tv = TfidfVectorizer()

'''Convert text into numerical features'''
X = vectorizer_tv.fit_transform(X_extracted)


### Logistic Regression - TFIDVectoriser

In [196]:
'''Train Test Split data for modelling'''
X_train, X_test, y_train, y_test = train_test_split(X, y_extracted, test_size=0.2, random_state=42)

'''Instantiate Logistic Regression Model'''
log_reg_model_tv = LogisticRegression()

'''Train and Fit model'''
log_reg_model_tv.fit(X_train, y_train)

'''Get predictions on test set'''
predictions = log_reg_model_tv.predict(X_test)

'''Calculate accuracy score'''
log_accuracy = accuracy_score(y_test, predictions)

'''Calculate mean coefficient'''
coefficients_mean = log_reg_model_tv.coef_.mean()

print('Model Performance for Logistic Regression Count Vectoriser:')
print('accuracy: ', log_accuracy)
print('Coefficient Mean: ', coefficients_mean)

Model Performance for Logistic Regression Count Vectoriser:
accuracy:  0.9823232323232324
Coefficient Mean:  0.0021066192306905306


### Random Forest - TFIDVectoriser

In [197]:
'''Train Test Split data for modelling'''
X_train, X_test, y_train, y_test = train_test_split(X, y_extracted, test_size=0.2, random_state=42)

'''Instantiate Decision Tree Model'''
rf_model_tv = RandomForestClassifier()

'''Train and Fit model'''
rf_model_tv.fit(X_train, y_train)

'''Get predictions on test set'''
predictions = rf_model_tv.predict(X_test)

'''Get feature importance'''
importances_mean = rf_model_tv.feature_importances_.mean()

'''Calculate accuracy score'''
rf_accuracy = accuracy_score(y_test, predictions)
print('Model Performance for Random Forest  Count Vectoriser:')
print('accuracy: ', rf_accuracy)
print('importance', importances_mean)


Model Performance for Random Forest  Count Vectoriser:
accuracy:  0.9722222222222222
importance 2.5716196060278763e-05



|               | Log Regression(CV)| Log Regression(TV)|Random Forest(CV) |Random Forest(TV)|
|---------------|-------------------|-------------------|------------------|-----------------|
|**accuracy**   | 0.972             | 0.982             | 0.967            |0.972            |
|**coef**       |0.0002127          | 0.002106          | -                |-                |
|**importance** |-                  | -                 |2.5716196e-05     |2.57161e-05      |


## Analysis

- The analysis is done between CountVectorizer and TfidVectoriser to understand how it can impact the accuracy level. CountVectorizer is a technique that converts text document into a matrix of token count. TfidVectoriser similarly converts text document into a matrix representation of token and also consider the relative importance of the word in the entire corpus.
- Logistic Regression using TfidVectoriser is the best performing baseline model with an accuracy of 0.982. The coefficient oof 0.002106 is around 10 times larger than the coefficient of 0.0002127 for Logistic Regression CountVectoriser. A larger positive coefficient represents a stronger influence in prediction with a positive association between feature and the positive label.
- It is observed that Random Forest TfidVectoriser model performs better than RandomForest CountVectoriser.
- For the current dataset, TfidVectoriser results in a higher accuracy.

# Hyperparameter Tuning

### Hyperparameter Tuning for Logistic Regression TfidVectorisor

Hyperparameter Tuning will be done using GridSearch CV to identify the best parameters to be used for the Logistic Regression Model with TfidVectorisation.

In [182]:
'''Define hyperparameter grid'''
param_grid_log_tv = {
    'C': [0.1, 1.0, 10.0],
    'penalty': [  'none', 'l2'] # only include l2 and none for penalty due to error message
}

'''loop through cv from 0 to 5 to find the parameters with the best score'''
for num in range(2,6):
    '''Instantiate logistic regression model and GridSearchCv'''
    model = LogisticRegression()
    grid_search = GridSearchCV(estimator=log_reg_model_tv, param_grid=param_grid_log_tv, cv=num)

    '''fit GridSearchCv to the model'''
    grid_search.fit(X_train, y_train)

    '''print best parameters and best score'''
    best_params = grid_search.best_params_
    best_score = grid_search.best_score_
    print('cv:',num)
    print("Best Parameters:", best_params)
    print("Best Score:", best_score)
    print('-'*5)



cv: 2
Best Parameters: {'C': 10.0, 'penalty': 'l2'}
Best Score: 0.9512946278545024
-----




cv: 3
Best Parameters: {'C': 10.0, 'penalty': 'l2'}
Best Score: 0.9538266919671093
-----




cv: 4
Best Parameters: {'C': 10.0, 'penalty': 'l2'}
Best Score: 0.9531997187060478
-----




cv: 5
Best Parameters: {'C': 10.0, 'penalty': 'l2'}
Best Score: 0.9557281475861519
-----



| CV            |Best Score for Logistic Regression TfidVectoriser|  
|---------------|-------------------------------------------------|
|**2**          |0.951                                            | 
|**3**          |0.954                                            | 
|**4**          |0.953                                            | 
|**5**          |0.956                                            |

Observed that the best score of 0.956 is lower than the highest accuracy of 0.982 for TfidVectoriser in the baseline logistic regression model. Hence the following parameters will not be used for the analysis. 

### Hyperparameter Tuning for Random Forest TfidVectoriser

In [183]:
'''Train Test Split data for modelling'''
X_train, X_test, y_train, y_test = train_test_split(X, y_extracted, test_size=0.2, random_state=42)

'''Define hyperparameter grid'''
param_grid_rf_tv = {
    'n_estimators': [100, 200, 300],  # Number of trees in the forest
    'max_depth': [None, 5, 10],  # Maximum depth of the trees
    'min_samples_split': [2, 5, 10],  # Minimum number of samples required to split an internal node
    'min_samples_leaf': [1, 2, 4]  # Minimum number of samples required to be at a leaf node
}

'''loop through cv from 0 to 5 to find the parameters with the best score'''
for num in range(2,6):
    '''Instantiate logistic regression model and GridSearchCv'''
    model = LogisticRegression()
    grid_search = GridSearchCV(estimator=rf_model_tv, param_grid=param_grid_rf_tv, cv=num)

    '''fit GridSearchCv to the model'''
    grid_search.fit(X_train, y_train)

    '''print best parameters and best score'''
    best_params = grid_search.best_params_
    best_score = grid_search.best_score_
    print('cv:',num)
    print("Best Parameters:", best_params)
    print("Best Score:", best_score)
    print('-'*5)


cv: 2
Best Parameters: {'max_depth': 10, 'min_samples_leaf': 1, 'min_samples_split': 10, 'n_estimators': 200}
Best Score: 0.9462353374193858
-----
cv: 3
Best Parameters: {'max_depth': None, 'min_samples_leaf': 1, 'min_samples_split': 5, 'n_estimators': 300}
Best Score: 0.9487666034155599
-----
cv: 4
Best Parameters: {'max_depth': None, 'min_samples_leaf': 2, 'min_samples_split': 5, 'n_estimators': 200}
Best Score: 0.9500319652218387
-----
cv: 5
Best Parameters: {'max_depth': None, 'min_samples_leaf': 4, 'min_samples_split': 2, 'n_estimators': 200}
Best Score: 0.9512957712734098
-----


| CV            |Best Score for Logistic Regression TfidVectoriser|  
|---------------|-------------------------------------------------|
|**2**          |0.946                                            | 
|**3**          |0.949                                            | 
|**4**          |0.950                                            | 
|**5**          |0.951                                            |

Observed that the best score of 0.951 is lower than the highest accuracy of 0.972 for TfidVectoriser in the baseline random forest model. Hence the following parameters will not be used for the analysis. 

# (M2) Modelling | Using tokenize_join_comments and author

Modellling will be done using tokenized_join_comments and author to understand if author has an impact on the prediction result.

## Preprocessing

Earlier in the EDA process it was observed the author has missing a small amount of rows missing. As the objective of the analysis was on the joined comments it was previously not removed. For the purpose of this analysis, rows from author will be removed. The impact is deemed to be negligible as only 4 rows will be removed. 

In [184]:
'''creates a copy of the dataframes before preprocessing'''
stats_ml_df_auth_del = stats_ml_df.copy()

'''remove rows from where column 'author' is null '''
stats_ml_df_auth_del = stats_ml_df_auth_del.dropna(subset=['author'])

In [187]:
'''check dataframe that rows where author is null is removed'''
stats_ml_df_auth_del.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1973 entries, 0 to 1976
Data columns (total 13 columns):
 #   Column                  Non-Null Count  Dtype              
---  ------                  --------------  -----              
 0   title                   1973 non-null   object             
 1   sub_text                1818 non-null   object             
 2   id                      1973 non-null   object             
 3   author                  1973 non-null   object             
 4   score                   1973 non-null   int64              
 5   upvote_ratio            1973 non-null   float64            
 6   comments_list           1973 non-null   object             
 7   datetime                1973 non-null   datetime64[ns, UTC]
 8   year                    1973 non-null   int64              
 9   month                   1973 non-null   int64              
 10  joined_data             1973 non-null   object             
 11  tokenize_join_comments  1973 non-null   obj

- Validated that the rows where author is null is removed. 
- Observed that the the rows which removed includes sub_text rows which are null.
- For the purpose of this analysis, the decision is made not te to remove the rows with missing value for 'sub_text'as the number of rows to be removed will be significant. The columns of focused would also be tokenize_join_comments hence its deemed to be acceptable to retained the rows. 

In [217]:
'''Instantiate X and y for prediction'''
X_extracted2 = stats_ml_df_auth_del['tokenize_join_comments']
y_extracted2 = stats_ml_df_auth_del['is_ml']

## Count Vectoriser Analysis

In [218]:
'''Create a CountVectorizer'''
vectorizer_cv2 = CountVectorizer()

'''Convert text into numerical features'''
X2 = vectorizer_cv2.fit_transform(X_extracted2)

### Logistic Regression - Count Vectoriser

In [227]:
'''Train Test Split data for modelling'''
X_train2, X_test2, y_train2, y_test2 = train_test_split(X2, y_extracted2, test_size=0.2, random_state=42)

'''Instantiate Logistic Regression Model'''
log_reg_model_cv2 = LogisticRegression()

'''Train and Fit model'''
log_reg_model_cv2.fit(X_train2, y_train2)

'''Get predictions on test set'''
predictions2 = log_reg_model_cv2.predict(X_test2)

'''Calculate accuracy score'''
log_accuracy2 = accuracy_score(y_test2, predictions2)

'''Calculate mean coefficient'''
coefficients_mean2 = log_reg_model_cv2.coef_.mean()

print('Model Performance for Logistic Regression Count Vectoriser:')
print('accuracy: ', round(log_accuracy2,3))
print('Coefficient Mean: ', coefficients_mean2)


Model Performance for Logistic Regression Count Vectoriser:
accuracy:  0.954
Coefficient Mean:  0.002219625576809319


### Random Forest- Count Vectoriser

In [228]:
'''Train Test Split data for modelling'''
X_train2, X_test2, y_train2, y_test2 = train_test_split(X2, y_extracted2, test_size=0.2, random_state=42)

'''Instantiate Random Forest Model'''
rf_model_cv2 = RandomForestClassifier()

'''Train and Fit model'''
rf_model_cv2.fit(X_train2, y_train2)

'''Get predictions on test set'''
predictions2 = rf_model_cv2.predict(X_test2)

'''Get feature importance'''
importances_mean2 = rf_model_cv2.feature_importances_.mean()

'''Calculate accuracy score'''
rf_accuracy2 = accuracy_score(y_test2, predictions2)
print('Model Performance for Random Forest  Count Vectoriser:')
print('accuracy: ', rf_accuracy2)
print('importance', importances_mean2)


Model Performance for Random Forest  Count Vectoriser:
accuracy:  0.9417721518987342
importance 2.5730091341824264e-05


## TFIDVectoriser Analysis

In [229]:
'''Create a CountVectorizer'''
vectorizer_tv2 = TfidfVectorizer()

'''Convert text into numerical features'''
X2 = vectorizer_tv2.fit_transform(X_extracted2)

### Logistic Regression - TFIDVectoriser

In [230]:
'''Train Test Split data for modelling'''
X_train2, X_test2, y_train2, y_test2 = train_test_split(X2, y_extracted2, test_size=0.2, random_state=42)

'''Instantiate Logistic Regression Model'''
log_reg_model_tv2 = LogisticRegression()

'''Train and Fit model'''
log_reg_model_tv2.fit(X_train2, y_train2)

'''Get predictions on test set'''
predictions2 = log_reg_model_tv2.predict(X_test2)

'''Calculate accuracy score'''
log_accuracy2 = accuracy_score(y_test2, predictions2)

'''Calculate mean coefficient'''
coefficients_mean2 = log_reg_model_tv2.coef_.mean()

print('Model Performance for Logistic Regression Count Vectoriser:')
print('accuracy: ', log_accuracy2)
print('Coefficient Mean: ', coefficients_mean2)

Model Performance for Logistic Regression Count Vectoriser:
accuracy:  0.9544303797468354
Coefficient Mean:  0.002219625576809319


### Random Forest - TFIDVectoriser

In [224]:
'''Train Test Split data for modelling'''
X_train2, X_test2, y_train2, y_test2 = train_test_split(X2, y_extracted2, test_size=0.2, random_state=42)

'''Instantiate Random Forest Model'''
rf_model_tv2 = RandomForestClassifier()

'''Train and Fit model'''
rf_model_tv2.fit(X_train2, y_train2)

'''Get predictions on test set'''
predictions2 = rf_model_tv2.predict(X_test2)

'''Get feature importance'''
importances_mean2 = rf_model_tv2.feature_importances_.mean()

'''Calculate accuracy score'''
rf_accuracy2 = accuracy_score(y_test2, predictions2)
print('Model Performance for Random Forest Count Vectoriser:')
print('accuracy: ', rf_accuracy2)
print('importance', importances_mean2)

Model Performance for Random Forest Count Vectoriser:
accuracy:  0.9493670886075949
importance 2.5730091341824264e-05



|               | Log Regression(CV)| Log Regression(TV)|Random Forest(CV) |Random Forest(TV)|
|---------------|-------------------|-------------------|------------------|-----------------|
|**accuracy**   |0.954              | 0.954             | 0.941            |0.949            |
|**coef**       |0.0002127          | 0.0022196         | -                |-                |
|**importance** |-                  | -                 |2.57300916e-05    |2.573009e-05     |

## Analysis

Observed that between CountVectoriser and TfidVectoriser, the model using the TfidVectoriser performs better for RandomForest but for Logistic Regression, the accuracy remains the same at 0.954

## Overall Analysis

The table below shows the best performing model using only the tokenized join comments(M1) and using both the tokenised joined comments and author (M2)

|               | Log Regression(TV) - M1 | Log Regression(TV) - M2|
|---------------|-------------------------|------------------------|
|**accuracy**   |0.982                    | 0.954                  | 

It is observed that best perfoming model comes from Log Regression(TV) - M1 which has a higher accuracy of 0.982. It appears that adding author column reduced the accuracy. This implies that the author column may have have increased variability potentially in the following ways where there are different authors posting to machine learning subreddit and statistics subreddit respectively.

# Conclusion

The initial exploratory analysis using CountVectoriser revealed interesting insights on both the machine learning subreddit and statistics subreddit. The insights identified are as followed:
- 'data' and 'model'are words which are commently used both subreddit. 'data' has a higher count in statistics subreddit and 'model' has the higher count in machine learning subreddit.
- 'http' is a common term that is found within both subreddit but it is observed that that machine learning has more posts containing http indicating references to URLs.

It is observed that the discussion in the statistics revolved around statistics concepts and methods with references to statistical terms such as statistical terms such as 'mean','variable', 'statistics, 'hypothesis', 'linear'.

It is observed that coding is more closely related to the community such as 'github' and 'code'. Words such as 'chatgpt', 'gpt','openai', 'ai' indicates that artificial intelligence is a topic of interest within the community itself.

The machine learning community is also more active with more posts per month.

Results from Log Regression(TV) - M1 with an accuracy of 0.982 which only analyses the joined tokenised comments indicates that just from the discussion itself (e.g. title, sub_text, comments) contains sufficient unique features to distinguish post from one another.

## Recommendation

The currently insights is useful from the perspective of the primary stakeholders who could be people interested to move into the machine learning field as data scientist.
- Insights from the post would be coding seems to be a relevant skill set.
- Other insights would be Machine Learning and Statistics are similar in certain discussions but the topic discussed can still be accurately predicted for their respective thread. This implies that despite the perceived similarities, there are differences between the posts on the topics.

## Next Steps

1) Overcoming the limitations in the number of post retrieved to more than 1000 post with future libraries improvement:
- Limitations from 1000~ post could affect the accuracy of analysis such as the trend of author activity within the same time period. Given that the machine learning subreddit is more active, 1000 post may contain only 2 months posts compared to statistiics which may have 3 months of post.

2) Further explore the models with different models such as KMeans to identify the relationships between the clusters.

3) Exploring the dataset in different ways such as:
- segregating the comments and replies into different levels for analysis.
- exploring the content of the reference link that are sent to find common topics of interest
- exploring the community if there are active in both threads or only statistics or machine learning subreddit respectively.