# Week07-TextMining

## Introduction:

This notebook explores the use of machine learning models to predict the gender of Twitter account holders based on their profile descriptions from the dataset provided on data.world website against the link: https://data.world/crowdflower/gender-classifier-data. The dataset contains 20,050 rows, with columns like user name, a random tweet, account profile, location, etc, total of 26 columns which gives the information about the user of a given twitter handle.

The objective of this project is to create a predictive model that, using the descriptions of users' profiles, can reliably identify the gender of Twitter users. To determine which machine learning method is best for this task, we will look into a number of options, such as AdaBoost, XGBoost, K-Nearest Neighbors (KNN), Decision Tree, Support Vector Machine (SVM), and Logistic Regression.

First, we will preprocess the dataset using methods like lemmatization for text preprocessing, Singular Value Decomposition (SVD) and other data prep techniques. After that, the dataset will be divided into training and testing sets. Next, in order to enhance each model's performance, we will train and evaluate it using cross-validation and hyperparameter tuning. Ultimately, the model that performs the best will be chosen based on how accurate it is on the test set.

Our goal is to develop a strong machine learning model by the conclusion of this notebook that can identify the gender of Twitter users based on their profile descriptions for the uses that will be listed in the final discussion in this notebook.

### Import common packages

In [2]:
import pandas as pd
import numpy as np

np.random_seed = 1

### Load data

In [3]:
df = pd.read_csv('/Users/sunilinus/Downloads/text_mining_data.csv', encoding='latin1')
df.head()

Unnamed: 0,_unit_id,_golden,_unit_state,_trusted_judgments,_last_judgment_at,gender,gender:confidence,profile_yn,profile_yn:confidence,created,...,profileimage,retweet_count,sidebar_color,text,tweet_coord,tweet_count,tweet_created,tweet_id,tweet_location,user_timezone
0,815719226,False,finalized,3,10/26/15 23:24,male,1.0,yes,1.0,12/5/13 1:48,...,https://pbs.twimg.com/profile_images/414342229...,0,FFFFFF,Robbie E Responds To Critics After Win Against...,,110964,10/26/15 12:40,6.5873e+17,main; @Kan1shk3,Chennai
1,815719227,False,finalized,3,10/26/15 23:30,male,1.0,yes,1.0,10/1/12 13:51,...,https://pbs.twimg.com/profile_images/539604221...,0,C0DEED,ÛÏIt felt like they were my friends and I was...,,7471,10/26/15 12:40,6.5873e+17,,Eastern Time (US & Canada)
2,815719228,False,finalized,3,10/26/15 23:33,male,0.6625,yes,1.0,11/28/14 11:30,...,https://pbs.twimg.com/profile_images/657330418...,1,C0DEED,i absolutely adore when louis starts the songs...,,5617,10/26/15 12:40,6.5873e+17,clcncl,Belgrade
3,815719229,False,finalized,3,10/26/15 23:10,male,1.0,yes,1.0,6/11/09 22:39,...,https://pbs.twimg.com/profile_images/259703936...,0,C0DEED,Hi @JordanSpieth - Looking at the url - do you...,,1693,10/26/15 12:40,6.5873e+17,"Palo Alto, CA",Pacific Time (US & Canada)
4,815719230,False,finalized,3,10/27/15 1:15,female,1.0,yes,1.0,4/16/14 13:23,...,https://pbs.twimg.com/profile_images/564094871...,0,0,Watching Neighbours on Sky+ catching up with t...,,31462,10/26/15 12:40,6.5873e+17,,


In [4]:
df.shape

(20050, 26)

In [5]:
#Check each columns 
df.columns

Index(['_unit_id', '_golden', '_unit_state', '_trusted_judgments',
       '_last_judgment_at', 'gender', 'gender:confidence', 'profile_yn',
       'profile_yn:confidence', 'created', 'description', 'fav_number',
       'gender_gold', 'link_color', 'name', 'profile_yn_gold', 'profileimage',
       'retweet_count', 'sidebar_color', 'text', 'tweet_coord', 'tweet_count',
       'tweet_created', 'tweet_id', 'tweet_location', 'user_timezone'],
      dtype='object')

### Dropping unnecessary columns

Dropping columns which do not seem directly related to predicting gender based on profile descriptions. Therefore, keeping them would introduce unnecessary complexity to the analysis.

Through exclusive attention to the 'gender' and 'description' columns, the study can focus on understanding the relationship between the content of account holder's profile descriptions and their gender. As a result, the task becomes simpler and it is straightforward to interpret the outcomes.

Hence dropping all the columns except 'gender' and 'description'.

In [6]:
df = df.drop(columns=['_unit_id', '_golden', '_unit_state', '_trusted_judgments',
       '_last_judgment_at', 'gender:confidence', 'profile_yn',
       'profile_yn:confidence', 'created', 'fav_number',
       'gender_gold', 'link_color', 'name', 'profile_yn_gold', 'profileimage',
       'retweet_count', 'sidebar_color', 'text', 'tweet_coord', 'tweet_count',
       'tweet_created', 'tweet_id', 'tweet_location', 'user_timezone'])

In [7]:
# Check for missing values
df.isnull().sum()

gender           97
description    3744
dtype: int64

Eliminating the rows containing null values since it is difficult to impute missing values in the text input-filled "description" column. Unlike numerical or categorical data, there is no straightforward way to impute missing text values that would preserve the integrity and context of the data. Also, removing these rows won't likely have a major effect on the modeling process because there aren't that many records in the dataset overall that have missing values. There is still enough data in the remaining dataset to train a reliable model.

In [8]:
# Drop rows with missing date values
df = df.dropna(subset=['gender', 'description'])

In [9]:
df['gender'].unique()

array(['male', 'female', 'brand', 'unknown'], dtype=object)

Dropping the records where the gender is 'unknown' since records where the gender is 'unknown' do not provide any meaningful information for training the model, as the model's objective is to predict the gender of Twitter account holders based on their profile descriptions. The inclusion of these records may result in noise introduction and a possible reduction in the prediction performance of the model.

In [10]:
# Drop records where 'gender' is 'unknown'
df = df.loc[df['gender'] != 'unknown']

In [11]:
# Calculate the imbalance in the dataset
gender_counts = df['gender'].value_counts()
total_samples = len(df)

# Calculate the proportion of each gender class
imbalance = gender_counts / total_samples

print("Imbalance in the dataset:")
print(imbalance)

Imbalance in the dataset:
female    0.368831
male      0.352339
brand     0.278830
Name: gender, dtype: float64


Not addressing the imbalance since there is no great imbalance observed in the dataset.

Downloading necessary libraries to run lemmetization:

In [10]:
import nltk
nltk.download('punkt')

[nltk_data] Downloading package punkt to /Users/sunilinus/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [11]:
import nltk
nltk.download('wordnet')

[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/sunilinus/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [12]:
import nltk
from nltk.corpus import wordnet
from nltk.stem import WordNetLemmatizer

# Lemmatization function
lemmatizer = WordNetLemmatizer()
def lemmatize_text(text):
    words = nltk.word_tokenize(text)
    lemmatized_words = [lemmatizer.lemmatize(word, pos='v') for word in words]
    return ' '.join(lemmatized_words)

In [13]:
# Apply lemmatization to 'description' column
df['description'] = df['description'].apply(lemmatize_text)

## Assign the input variable to X and the target variable to y

In [14]:
X = df['description']

This is a multi-class classification problem. There are three categories we will predict:<br>
Whether a twitter account owner is a 'male' or a 'female' or a 'brand'.

In [15]:
y = df['gender']
y.unique()

array(['male', 'female', 'brand'], dtype=object)

In [16]:
from sklearn import preprocessing

le = preprocessing.LabelEncoder()
le.fit(y)
print(le.classes_)
y = le.transform(y)

y


['brand' 'female' 'male']


array([2, 2, 2, ..., 2, 1, 1])

## Split the data

Given the size of the data, splitting it into 70% train and 30% test.

In [17]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)

In [18]:
X_train.shape, y_train.shape

((10865,), (10865,))

In [19]:
X_test.shape, y_test.shape

((4657,), (4657,))

In [20]:
X_train.head(5)

13140    Writer of chapter 34 Our Encounters with Madne...
10932    just a Puerto Rican thats thicker than a bowl ...
1416              Keep your heel , head and standards high
16483                                    Forex , Marketing
11496                       â¬Ä_Ä â£ÄÄâøâ¢Ä_Ä Ä_
Name: description, dtype: object

In [21]:
y_train[:5]

array([1, 1, 1, 0, 0])

## Sklearn: Text preparation

In the below code, TfidfVectorizer was used to convert text data from the 'description' column into numerical features that can be used for machine learning models.

In [22]:
#TfidfVectorizer includes pre-processing, tokenization, filtering stop words
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf_vect = TfidfVectorizer(stop_words='english', lowercase=True, token_pattern="[^\W\d_]+")

X_train = tfidf_vect.fit_transform(X_train)

In [23]:
# Perform the TfidfVectorizer transformation
# Be careful: We are using the train fit to transform the test data set. Otherwise, the test data 
# features will be very different and match the train set!!!

X_test = tfidf_vect.transform(X_test)


In [24]:
X_train.shape, X_test.shape

((10865, 24059), (4657, 24059))

In [25]:
# These data sets are "sparse matrix". We can't see them unless we convert using toarray()
X_train

<10865x24059 sparse matrix of type '<class 'numpy.float64'>'
	with 97578 stored elements in Compressed Sparse Row format>

In [26]:
# These data sets are "sparse matrix". We can't see them unless we convert using toarray()
X_train.toarray()

array([[0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.]])

## Latent Semantic Analysis (Singular Value Decomposition)

The purpose of its application is to reduce the dimensionality of the text data in the 'description' column by preprocessing it and extracting significant features that can be trained into machine learning models to predict the gender of Twitter account holders based on their profile descriptions, ultimately to reduce the dimensionality of the text data in the 'description' column.

In [27]:
from sklearn.decomposition import TruncatedSVD

svd = TruncatedSVD(n_components=300, n_iter=10) #n_components is the number of topics, which should be less than the number of features

X_train= svd.fit_transform(X_train)
X_test = svd.transform(X_test)


In [28]:
X_train.shape, X_test.shape

((10865, 300), (4657, 300))

Now the dataset is preprocessed and ready to train and evaluate various machine learning models.

## Random Forest

In [29]:
from sklearn.ensemble import RandomForestClassifier 

rnd_clf = RandomForestClassifier(n_estimators=100, max_leaf_nodes=16, n_jobs=-1) 
_ = rnd_clf.fit(X_train, y_train)

### Evaluating Model Performance

In [30]:
from sklearn.metrics import accuracy_score

In [31]:
#Train accuracy - Not a good measure of model performance as we are using the same data set to train and test
y_pred_train_rf = rnd_clf.predict(X_train)
train_acc_rf = accuracy_score(y_train, y_pred_train_rf)
print(f"Train acc (Random Forest): {accuracy_score(y_train, y_pred_train_rf):.4f}")

Train acc (Random Forest): 0.5566


In [32]:
#Test accuracy
y_pred_test_rf = rnd_clf.predict(X_test)
test_acc_rf = accuracy_score(y_test, y_pred_test_rf)
print(f"Train acc (Random Forest): {accuracy_score(y_test, y_pred_test_rf):.4f}")

Train acc (Random Forest): 0.5353


In [33]:
from sklearn.metrics import confusion_matrix

confusion_matrix_rf = confusion_matrix(y_test, y_pred_test_rf)
print("Confusion Matrix (Logistic Regression):")
print(confusion_matrix_rf)

Confusion Matrix (Logistic Regression):
[[ 652  336  303]
 [ 116 1237  375]
 [ 161  873  604]]


### Hyperparameter Tuning on Random Forest:

n_estimators: The number of trees in the forest. We will test values of 50, 100, and 150 to see which number of trees provides the best performance without overfitting or underfitting.

max_depth: The maximum depth of the trees. We will test values of 5, 10, and 15 to control the complexity of the model and avoid overfitting.

min_samples_split: The minimum number of samples required to split an internal node. We will test values of 2, 5, and 10 to determine the optimal value for splitting nodes.

min_samples_leaf: The minimum number of samples required to be at a leaf node. We will test values of 1, 2, and 4 to control overfitting by setting a minimum threshold for samples at leaf nodes.

max_features: The number of features to consider when looking for the best split. We will test values of 'sqrt', 'log2', and None (considering all features) to determine the optimal number of features to consider.

In [34]:
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestClassifier

# Define the parameter grid to search
param_grid = {
    'n_estimators': [50, 100, 150],
    'max_depth': [5, 10, 15],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4],
    'max_features': ['sqrt', 'log2', None]
}

# Create the Random Forest model
rf_model = RandomForestClassifier(random_state=42)

# Perform Grid Search to find the best hyperparameters
grid_search = GridSearchCV(estimator=rf_model, param_grid=param_grid, cv=3, n_jobs=-1)
grid_search.fit(X_train, y_train)

# Print the best hyperparameters found
print("Best Hyperparameters:")
print(grid_search.best_params_)

# Evaluate the best model on the test set
best_rf_model = grid_search.best_estimator_
y_pred_test_best = best_rf_model.predict(X_test)
test_acc_best = accuracy_score(y_test, y_pred_test_best)
print(f"Test acc (Best Random Forest): {test_acc_best:.4f}")

Best Hyperparameters:
{'max_depth': 15, 'max_features': 'sqrt', 'min_samples_leaf': 4, 'min_samples_split': 2, 'n_estimators': 150}
Test acc (Best Random Forest): 0.5832


## Logistic Regression

In [35]:
from sklearn.linear_model import LogisticRegression

# Create and train the Logistic Regression model
log_reg = LogisticRegression(max_iter=1000)
_ = log_reg.fit(X_train, y_train)

### Evaluating Model Performance

In [36]:
y_pred_train_lr = log_reg.predict(X_train)
train_acc_lr = accuracy_score(y_train, y_pred_train_lr)
print(f"Train acc (Logistic Regression): {train_acc_lr:.4f}")

Train acc (Logistic Regression): 0.6075


In [37]:
y_pred_test_lr = log_reg.predict(X_test)
test_acc_lr = accuracy_score(y_test, y_pred_test_lr)
print(f"Test acc (Logistic Regression): {test_acc_lr:.4f}")

Test acc (Logistic Regression): 0.5905


In [38]:
# Confusion Matrix for Logistic Regression
confusion_matrix_lr = confusion_matrix(y_test, y_pred_test_lr)
print("Confusion Matrix (Logistic Regression):")
print(confusion_matrix_lr)

Confusion Matrix (Logistic Regression):
[[ 775  257  259]
 [ 128 1168  432]
 [ 171  660  807]]


### Hyperparameter tuning on Logistic Regression:

C: Inverse of regularization strength. Smaller values specify stronger regularization. We will test values of 0.1, 1, and 10 to find the optimal regularization strength.

penalty: The norm used in the penalization. We will test 'l1' (Lasso) and 'l2' (Ridge) penalties to see which performs better.

In [39]:
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import LogisticRegression

# Define the parameter grid to search
param_grid = {
    'C': [0.1, 1, 10],
    'penalty': ['l1', 'l2']
}

# Create the Logistic Regression model
log_reg_model = LogisticRegression(random_state=42, max_iter=1000)

# Perform Grid Search to find the best hyperparameters
grid_search_lr = GridSearchCV(estimator=log_reg_model, param_grid=param_grid, cv=3, n_jobs=-1)
grid_search_lr.fit(X_train, y_train)

# Print the best hyperparameters found
print("Best Hyperparameters:")
print(grid_search_lr.best_params_)

# Evaluate the best model on the test set
best_lr_model = grid_search_lr.best_estimator_
y_pred_test_lr_best = best_lr_model.predict(X_test)
test_acc_lr_best = accuracy_score(y_test, y_pred_test_lr_best)
print(f"Test acc (Best Logistic Regression): {test_acc_lr_best:.4f}")

9 fits failed out of a total of 18.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
9 fits failed with the following error:
Traceback (most recent call last):
  File "/Users/sunilinus/anaconda3/lib/python3.11/site-packages/sklearn/model_selection/_validation.py", line 729, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "/Users/sunilinus/anaconda3/lib/python3.11/site-packages/sklearn/base.py", line 1152, in wrapper
    return fit_method(estimator, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/sunilinus/anaconda3/lib/python3.11/site-packages/sklearn/linear_model/_logistic.py", line 1169, in fit
    solver = _check_solver(self.solver, self.penalty, self.dual)
             ^^^^^

Best Hyperparameters:
{'C': 1, 'penalty': 'l2'}
Test acc (Best Logistic Regression): 0.5905


## K-Nearest Neighbors (KNN)

In [40]:
from sklearn.neighbors import KNeighborsClassifier

# Create and train the KNN model
knn = KNeighborsClassifier(n_neighbors=5)
_ = knn.fit(X_train, y_train)

### Evaluating Model Performance

In [41]:
y_pred_train_knn = knn.predict(X_train)
train_acc_knn = accuracy_score(y_train, y_pred_train_knn)
print(f"Train acc (KNN): {train_acc_knn:.4f}")

Train acc (KNN): 0.6574


In [42]:
y_pred_test_knn = knn.predict(X_test)
test_acc_knn = accuracy_score(y_test, y_pred_test_knn)
print(f"Test acc (KNN): {test_acc_knn:.4f}")

Test acc (KNN): 0.4904


In [43]:
# Confusion Matrix for KNN
confusion_matrix_knn = confusion_matrix(y_test, y_pred_test_knn)
print("Confusion Matrix (KNN):")
print(confusion_matrix_knn)

Confusion Matrix (KNN):
[[709 293 289]
 [316 924 488]
 [362 625 651]]


### Hyperparameter tuning on KNN:

n_neighbors: The number of neighbors to consider. We will test values of 3, 5, and 7 to find the optimal number of neighbors.

weights: The weight function used in prediction. We will test 'uniform' and 'distance' to see which performs better.

p: The power parameter for the Minkowski distance. We will test values of 1 (Manhattan distance) and 2 (Euclidean distance).

In [44]:
from sklearn.model_selection import GridSearchCV
from sklearn.neighbors import KNeighborsClassifier

# Define the parameter grid to search
param_grid = {
    'n_neighbors': [3, 5, 7],
    'weights': ['uniform', 'distance'],
    'p': [1, 2]
}

# Create the KNN model
knn_model = KNeighborsClassifier()

# Perform Grid Search to find the best hyperparameters
grid_search_knn = GridSearchCV(estimator=knn_model, param_grid=param_grid, cv=3, n_jobs=-1)
grid_search_knn.fit(X_train, y_train)

# Print the best hyperparameters found
print("Best Hyperparameters:")
print(grid_search_knn.best_params_)

# Evaluate the best model on the test set
best_knn_model = grid_search_knn.best_estimator_
y_pred_test_knn_best = best_knn_model.predict(X_test)
test_acc_knn_best = accuracy_score(y_test, y_pred_test_knn_best)
print(f"Test acc (Best KNN): {test_acc_knn_best:.4f}")

Best Hyperparameters:
{'n_neighbors': 7, 'p': 2, 'weights': 'distance'}
Test acc (Best KNN): 0.5123


## Support Vector Machine (SVM)

In [45]:
from sklearn.svm import SVC

# Create and train the SVM model
svm = SVC(kernel='linear')
_ = svm.fit(X_train, y_train)

### Evaluating Model Performance

In [46]:
y_pred_train_svm = svm.predict(X_train)
train_acc_svm = accuracy_score(y_train, y_pred_train_svm)
print(f"Train acc (SVM): {train_acc_svm:.4f}")

Train acc (SVM): 0.5983


In [47]:
y_pred_test_svm = svm.predict(X_test)
test_acc_svm = accuracy_score(y_test, y_pred_test_svm)
print(f"Test acc (SVM): {test_acc_svm:.4f}")

Test acc (SVM): 0.5738


In [48]:
# Confusion Matrix for SVM
confusion_matrix_svm = confusion_matrix(y_test, y_pred_test_svm)
print("Confusion Matrix (SVM):")
print(confusion_matrix_svm)

Confusion Matrix (SVM):
[[ 694  314  283]
 [ 112 1243  373]
 [ 145  758  735]]


### Hyperparameter tuning on SVM:

C: Regularization parameter. We will test values of 0.1, 1, and 10 to find the optimal regularization strength.

kernel: Kernel type to be used in the algorithm. We will test 'linear', 'rbf', and 'sigmoid' kernels to find the best kernel for our model.

gamma: Kernel coefficient for 'rbf' and 'sigmoid'. We will test values of 'scale' and 'auto' to determine the optimal gamma value.

In [49]:
from sklearn.model_selection import GridSearchCV
from sklearn.svm import SVC

# Define the parameter grid to search
param_grid = {
    'C': [0.1, 1, 10],
    'kernel': ['linear', 'rbf', 'sigmoid'],
    'gamma': ['scale', 'auto']
}

# Create the SVM model
svm_model = SVC(random_state=42)

# Perform Grid Search to find the best hyperparameters
grid_search_svm = GridSearchCV(estimator=svm_model, param_grid=param_grid, cv=3, n_jobs=-1)
grid_search_svm.fit(X_train, y_train)

# Print the best hyperparameters found
print("Best Hyperparameters:")
print(grid_search_svm.best_params_)

# Evaluate the best model on the test set
best_svm_model = grid_search_svm.best_estimator_
y_pred_test_svm_best = best_svm_model.predict(X_test)
test_acc_svm_best = accuracy_score(y_test, y_pred_test_svm_best)
print(f"Test acc (Best SVM): {test_acc_svm_best:.4f}")

Best Hyperparameters:
{'C': 1, 'gamma': 'scale', 'kernel': 'rbf'}
Test acc (Best SVM): 0.6032


## Decision Tree

In [50]:
from sklearn.tree import DecisionTreeClassifier

# Create and train the Decision Tree model
dt = DecisionTreeClassifier(max_depth=5)
_ = dt.fit(X_train, y_train)

### Evaluating Model Performance

In [51]:
y_pred_train_dt = dt.predict(X_train)
train_acc_dt = accuracy_score(y_train, y_pred_train_dt)
print(f"Train acc (Decision Tree): {train_acc_dt:.4f}")

Train acc (Decision Tree): 0.5227


In [52]:
y_pred_test_dt = dt.predict(X_test)
test_acc_dt = accuracy_score(y_test, y_pred_test_dt)
print(f"Test acc (Decision Tree): {test_acc_dt:.4f}")

Test acc (Decision Tree): 0.5149


In [53]:
# Confusion Matrix for Decision Tree
confusion_matrix_dt = confusion_matrix(y_test, y_pred_test_dt)
print("Confusion Matrix (Decision Tree):")
print(confusion_matrix_dt)

Confusion Matrix (Decision Tree):
[[683 192 416]
 [175 909 644]
 [236 596 806]]


### Hyperparameter tuning on Decision Tree:

max_depth: The maximum depth of the tree. We will test values of 5, 10, and 15 to control the complexity of the model and avoid overfitting.

min_samples_split: The minimum number of samples required to split an internal node. We will test values of 2, 5, and 10 to determine the optimal value for splitting nodes.

min_samples_leaf: The minimum number of samples required to be at a leaf node. We will test values of 1, 2, and 4 to control overfitting by setting a minimum threshold for samples at leaf nodes.

max_features: The number of features to consider when looking for the best split. We will test values of 'sqrt', 'log2', and None (considering all features) to determine the optimal number of features to consider.

In [54]:
from sklearn.model_selection import GridSearchCV
from sklearn.tree import DecisionTreeClassifier

# Define the parameter grid to search
param_grid = {
    'max_depth': [5, 10, 15],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4],
    'max_features': ['sqrt', 'log2', None]
}

# Create the Decision Tree model
dt_model = DecisionTreeClassifier(random_state=42)

# Perform Grid Search to find the best hyperparameters
grid_search_dt = GridSearchCV(estimator=dt_model, param_grid=param_grid, cv=3, n_jobs=-1)
grid_search_dt.fit(X_train, y_train)

# Print the best hyperparameters found
print("Best Hyperparameters:")
print(grid_search_dt.best_params_)

# Evaluate the best model on the test set
best_dt_model = grid_search_dt.best_estimator_
y_pred_test_dt_best = best_dt_model.predict(X_test)
test_acc_dt_best = accuracy_score(y_test, y_pred_test_dt_best)
print(f"Test acc (Best Decision Tree): {test_acc_dt_best:.4f}")

Best Hyperparameters:
{'max_depth': 5, 'max_features': None, 'min_samples_leaf': 1, 'min_samples_split': 2}
Test acc (Best Decision Tree): 0.5151


## AdaBoost

In [55]:
from sklearn.ensemble import AdaBoostClassifier

# Create and train the AdaBoost model
ada = AdaBoostClassifier(n_estimators=50, learning_rate=1.0)
_ = ada.fit(X_train, y_train)

### Evaluating Model Performance

In [56]:
y_pred_train_ada = ada.predict(X_train)
train_acc_ada = accuracy_score(y_train, y_pred_train_ada)
print(f"Train acc (AdaBoost): {train_acc_ada:.4f}")

Train acc (AdaBoost): 0.5482


In [57]:
y_pred_test_ada = ada.predict(X_test)
test_acc_ada = accuracy_score(y_test, y_pred_test_ada)
print(f"Test acc (AdaBoost): {test_acc_ada:.4f}")

Test acc (AdaBoost): 0.5265


In [58]:
# Confusion Matrix for AdaBoost
confusion_matrix_ada = confusion_matrix(y_test, y_pred_test_ada)
print("Confusion Matrix (AdaBoost):")
print(confusion_matrix_ada)

Confusion Matrix (AdaBoost):
[[ 766  230  295]
 [ 237 1015  476]
 [ 306  661  671]]


### Hyperparameter tuning on AdaBoost:

n_estimators: The number of base estimators. We will test values of 50, 100, and 150 to find the optimal number of estimators.

learning_rate: The learning rate shrinks the contribution of each classifier. We will test values of 0.1, 0.5, and 1 to find the optimal learning rate.

In [59]:
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import AdaBoostClassifier

# Define the parameter grid to search
param_grid = {
    'n_estimators': [50, 100, 150],
    'learning_rate': [0.1, 0.5, 1]
}

# Create the AdaBoost model
adaboost_model = AdaBoostClassifier(random_state=42)

# Perform Grid Search to find the best hyperparameters
grid_search_adaboost = GridSearchCV(estimator=adaboost_model, param_grid=param_grid, cv=3, n_jobs=-1)
grid_search_adaboost.fit(X_train, y_train)

# Print the best hyperparameters found
print("Best Hyperparameters:")
print(grid_search_adaboost.best_params_)

# Evaluate the best model on the test set
best_adaboost_model = grid_search_adaboost.best_estimator_
y_pred_test_adaboost_best = best_adaboost_model.predict(X_test)
test_acc_adaboost_best = accuracy_score(y_test, y_pred_test_adaboost_best)
print(f"Test acc (Best AdaBoost): {test_acc_adaboost_best:.4f}")

Best Hyperparameters:
{'learning_rate': 0.5, 'n_estimators': 150}
Test acc (Best AdaBoost): 0.5585


## XGBoost

In [60]:
import xgboost as xgb

# Create and train the XGBoost model
xgboost_model = xgb.XGBClassifier(objective='multi:softmax', num_class=3, max_depth=5, n_estimators=100)
_ = xgboost_model.fit(X_train, y_train)

### Evaluating Model Performance

In [61]:
y_pred_train_xgb = xgboost_model.predict(X_train)
train_acc_xgb = accuracy_score(y_train, y_pred_train_xgb)
print(f"Train acc (XGBoost): {train_acc_xgb:.4f}")

Train acc (XGBoost): 0.9586


In [62]:
y_pred_test_xgb = xgboost_model.predict(X_test)
test_acc_xgb = accuracy_score(y_test, y_pred_test_xgb)
print(f"Test acc (XGBoost): {test_acc_xgb:.4f}")

Test acc (XGBoost): 0.5954


In [63]:
# Confusion Matrix for XGBoost
confusion_matrix_xgb = confusion_matrix(y_test, y_pred_test_xgb)
print("Confusion Matrix (XGBoost):")
print(confusion_matrix_xgb)

Confusion Matrix (XGBoost):
[[ 812  216  263]
 [ 151 1071  506]
 [ 184  564  890]]


### Hyperparameter tuning on XGBoost:

n_estimators: The number of boosting rounds. We will test values of 50, 100, and 150 to find the optimal number of estimators.

learning_rate: Step size shrinkage used to prevent overfitting. We will test values of 0.1, 0.5, and 1 to find the optimal learning rate.

max_depth: Maximum depth of a tree. We will test values of 3, 5, and 7 to control the complexity of the model and avoid overfitting.

subsample: Subsample ratio of the training instances. We will test values of 0.6, 0.8, and 1.0 to determine the optimal subsample ratio.

In [64]:
from sklearn.model_selection import GridSearchCV
from xgboost import XGBClassifier

# Define the parameter grid to search
param_grid = {
    'n_estimators': [50, 100, 150],
    'learning_rate': [0.1, 0.5, 1],
    'max_depth': [3, 5, 7],
    'subsample': [0.6, 0.8, 1.0]
}

# Create the XGBoost model
xgb_model = XGBClassifier(random_state=42)

# Perform Grid Search to find the best hyperparameters
grid_search_xgb = GridSearchCV(estimator=xgb_model, param_grid=param_grid, cv=3, n_jobs=-1)
grid_search_xgb.fit(X_train, y_train)

# Print the best hyperparameters found
print("Best Hyperparameters:")
print(grid_search_xgb.best_params_)

# Evaluate the best model on the test set
best_xgb_model = grid_search_xgb.best_estimator_
y_pred_test_xgb_best = best_xgb_model.predict(X_test)
test_acc_xgb_best = accuracy_score(y_test, y_pred_test_xgb_best)
print(f"Test acc (Best XGBoost): {test_acc_xgb_best:.4f}")

Best Hyperparameters:
{'learning_rate': 0.1, 'max_depth': 5, 'n_estimators': 100, 'subsample': 0.8}
Test acc (Best XGBoost): 0.6004


## Analysis and Discussion:

Firsly let us explain why we chose accuracy as our principle parameter:<br>Because accuracy is easily interpreted, aligned with its intended purpose, appropriate for balanced classes, and allows for easy model comparison, it was selected as the main parameter to assess the model's effectiveness in predicting gender based on descriptions from Twitter profiles. Stakeholders can easily comprehend accuracy since it gives a clear indication of the proportion of gender labels that were successfully predicted out of all the guesses. Giving each gender category equal weight and accurately defining gender is directly aligned with the purpose of delivering a fair assessment of model performance across all classes. While other metrics like precision, recall, F1-score, or AUC-ROC may be more suited in circumstances with imbalanced datasets or changing misclassification costs, accuracy is sufficient for this task.

Random Forest:<br>Prior to hyperparameter adjustment, the Random Forest model demonstrated a modest level of performance, with a test accuracy of 53.53%. Following adjustment, the accuracy increased to 58.32%, showing that the model benefited from having its parameters optimized for more accurate prediction. This improvement shows that gender can be predicted using Random Forest's analysis of profile descriptions.

Logistic Regression:<br>Logistic Regression yielded a test accuracy of 59.05%, which was somewhat better than Random Forest. Although hyperparameter tuning did not increase the performance, Logistic Regression's analysis of text data implies that it can accurately determine gender.

K-Nearest Neighbors(KNN):<br>KNN showed a high test accuracy of 49.04%, suggesting possible overfitting, but a poor train accuracy of 65.94%. The test accuracy increased to 51.23% after hyperparameter adjustment, although it was still not as high as other models. Based on its performance, KNN might not be the ideal option for this text classification assignment.

Support Vector Machine(SVM):<br>Prior to hyperparameter adjustment, SVM demonstrated a reasonable level of performance, with a test accuracy of 57.38%. After Hyperparameter tuning, the accuracy increased to 60.32%, suggesting that SVM is a useful tool for reliably predicting gender in text data analysis.

Decision Tree:<br>Decision Tree performed poorly, achieving 51.49% on the test. The test accuracy was not considerably increased by hyperparameter tuning, indicating that Decision Tree may not be the best model for this particular task.

AdaBoost:<br>AdaBoost achieved a test accuracy of 52.65% prior to tuning, which was comparable to Decision Tree's performance. Following adjustment, the accuracy increased to 55.85%, suggesting a slight improvement. Based on its performance, AdaBoost might not be the best model for this particular type of task.

XGBoost:<br>With a train accuracy of 95.66%, XGBoost showed the highest performance, which may indicate overfitting. After some fine-tuning, the initial model's test accuracy increased to 60.04% from 59.54%. Based on its performance, XGBoost appears to be able to evaluate text input efficiently and determine gender accurately.

### Conclusion: 

With test accuracies of about 60%, the models assessed in this study that showed most potential for determining gender based on descriptions of Twitter profiles were Logistic Regression, Support Vector Machine (SVM), and XGBoost. These models showed a better degree of effectiveness in accurately predicting gender based on text data analysis. The selection of accuracy as the performance parameter was suitable for this task since it offered a transparent indicator of the model's effectiveness in a multi-class classification context. This study has other possible use cases and outcomes beyond its ability to guess the gender of Twitter account holders based on their profile descriptions. For example:

Social Media Analysis: Social media analytics firms can use this technique to learn more about the demographics of Twitter users. Use of this data for marketing and advertising campaigns may be beneficial.

Targeted Marketing: Based on the gender distribution of their target audience on Twitter, businesses can use this model to more successfully target their marketing initiatives.

Brand Perception Analysis: Businesses can learn more about how different genders view their brand by examining the gender distribution of followers and how they interact with the social media accounts of brands.

### By
#### Sunil Yanagandula<br>Kumara Swamy Padigeri