## TWITTER ACCOUNT CLASSIFICATION
- Aim: using text analysis to classify Twitter account
- My approach: Focus on text analysis only with vanilla approach to set a baseline
- Author: Huyen Nguyen (xuanhuyen3011@github)

### Copyright(c) [2024] [Xuan Huyen Nguyen]
- This project is part of the CSIT946 subject at UoW, where the work involves processing different types of numeric and text data to classify Twitter accounts. The code is provided for educational purposes and demonstration use only. 
- DISCLAIMER: It comes with no guarantees or warranties, and I, as the author, am not liable for any issues that arise from its use. While you're welcome to learn from it, please do not copy or distribute it for your own coursework or assignments without permission.

### Import necessary libraries

In [12]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split

### Load the data into a df

In [2]:
raw_df = pd.read_csv('twitter_user_data.csv', encoding = "ISO-8859-1")

In [3]:
raw_df.head()

Unnamed: 0,_unit_id,_golden,_unit_state,_trusted_judgments,_last_judgment_at,gender,gender:confidence,profile_yn,profile_yn:confidence,created,...,profileimage,retweet_count,sidebar_color,text,tweet_coord,tweet_count,tweet_created,tweet_id,tweet_location,user_timezone
0,815719226,False,finalized,3,10/26/15 23:24,male,1.0,yes,1.0,12/5/13 1:48,...,https://pbs.twimg.com/profile_images/414342229...,0,FFFFFF,Robbie E Responds To Critics After Win Against...,,110964,10/26/15 12:40,6.5873e+17,main; @Kan1shk3,Chennai
1,815719227,False,finalized,3,10/26/15 23:30,male,1.0,yes,1.0,10/1/12 13:51,...,https://pbs.twimg.com/profile_images/539604221...,0,C0DEED,ÛÏIt felt like they were my friends and I was...,,7471,10/26/15 12:40,6.5873e+17,,Eastern Time (US & Canada)
2,815719228,False,finalized,3,10/26/15 23:33,male,0.6625,yes,1.0,11/28/14 11:30,...,https://pbs.twimg.com/profile_images/657330418...,1,C0DEED,i absolutely adore when louis starts the songs...,,5617,10/26/15 12:40,6.5873e+17,clcncl,Belgrade
3,815719229,False,finalized,3,10/26/15 23:10,male,1.0,yes,1.0,6/11/09 22:39,...,https://pbs.twimg.com/profile_images/259703936...,0,C0DEED,Hi @JordanSpieth - Looking at the url - do you...,,1693,10/26/15 12:40,6.5873e+17,"Palo Alto, CA",Pacific Time (US & Canada)
4,815719230,False,finalized,3,10/27/15 1:15,female,1.0,yes,1.0,4/16/14 13:23,...,https://pbs.twimg.com/profile_images/564094871...,0,0,Watching Neighbours on Sky+ catching up with t...,,31462,10/26/15 12:40,6.5873e+17,,


In [4]:
raw_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20050 entries, 0 to 20049
Data columns (total 26 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   _unit_id               20050 non-null  int64  
 1   _golden                20050 non-null  bool   
 2   _unit_state            20050 non-null  object 
 3   _trusted_judgments     20050 non-null  int64  
 4   _last_judgment_at      20000 non-null  object 
 5   gender                 19953 non-null  object 
 6   gender:confidence      20024 non-null  float64
 7   profile_yn             20050 non-null  object 
 8   profile_yn:confidence  20050 non-null  float64
 9   created                20050 non-null  object 
 10  description            16306 non-null  object 
 11  fav_number             20050 non-null  int64  
 12  gender_gold            50 non-null     object 
 13  link_color             20050 non-null  object 
 14  name                   20050 non-null  object 
 15  pr

### Initial analysis

- There is 97 Nan and 1117 unknown

In [5]:
gender_table = raw_df['gender'].value_counts(dropna=False)
print(gender_table)

gender
female     6700
male       6194
brand      5942
unknown    1117
NaN          97
Name: count, dtype: int64


### Rationale for my approach of data preprocessing
- In order to do this, we will drop the columns that are not related to the tasks
- Because of no ground truth provided, NaN will be dropped, to avoid adding more confusion to the model.
- As we want to distinguish between human and non-human, we will combine both gender columns. A confidence threshold will be determinded so any profile below this will be counted towards non-human account, which will be explained below


In [6]:
# have a look at the golden dataset
golden_df = raw_df[raw_df['_golden'] == True]


In [7]:
golden_df.head()

Unnamed: 0,_unit_id,_golden,_unit_state,_trusted_judgments,_last_judgment_at,gender,gender:confidence,profile_yn,profile_yn:confidence,created,...,profileimage,retweet_count,sidebar_color,text,tweet_coord,tweet_count,tweet_created,tweet_id,tweet_location,user_timezone
20000,815746503,True,golden,249,,male,0.9612,yes,0.9612,8/5/10 8:31,...,https://pbs.twimg.com/profile_images/643511925...,0,FFFFFF,Reimagining the #webdesign process by @InVisio...,,3874,10/26/15 12:40,6.5873e+17,127.0.0.1,Athens
20001,815750089,True,golden,271,,brand,0.9622,yes,1.0,9/10/14 16:30,...,https://pbs.twimg.com/profile_images/509901605...,0,0,#WestHam Tweets: 52: Goal. @FulhamFC double th...,,24827,10/26/15 13:20,6.5874e+17,,
20002,815750297,True,golden,245,,brand,1.0,yes,1.0,5/11/09 15:31,...,https://pbs.twimg.com/profile_images/532477863...,1,FFFFFF,Webber: 'It's a chance for the lads to pit the...,,42075,10/26/15 12:40,6.5873e+17,"Wembley Stadium, London",London
20003,815750417,True,golden,245,,brand,0.6408,yes,1.0,8/1/14 13:20,...,https://abs.twimg.com/sticky/default_profile_i...,0,C0DEED,Get Weather Updates from The Weather Channel. ...,,63240,10/26/15 12:40,6.5873e+17,,
20004,815750696,True,golden,261,,male,1.0,yes,1.0,3/26/12 14:40,...,https://pbs.twimg.com/profile_images/604379716...,0,0,@TheFalcoholic is like the mailman... Because ...,,3296,10/26/15 12:40,6.5873e+17,Parts Unknown,Pacific Time (US & Canada)


- Find the confidence level in the golden dataset (with human input) to find the threshold. As we can see, the lowest confidence is 0.554, so we would use this as the threshold. Any profile with value below this will be counted towards non human.

In [8]:

# Group by 'gender:confidence' and calculate the percentage of True values in '_golden'
confidence_levels = raw_df.groupby('gender:confidence')['_golden'].mean() * 100

# Find confidence levels with more than 50% True in '_golden'
high_confidence_levels = confidence_levels[confidence_levels > 50]

# Convert the result to a DataFrame for better readability
high_confidence_df = high_confidence_levels.reset_index()
high_confidence_df.columns = ['gender:confidence', 'percentage_true']

print(high_confidence_df)

    gender:confidence  percentage_true
0              0.5540            100.0
1              0.5654            100.0
2              0.5762            100.0
3              0.6059            100.0
4              0.6364            100.0
5              0.7538            100.0
6              0.7650            100.0
7              0.8397            100.0
8              0.8431            100.0
9              0.8458            100.0
10             0.8472            100.0
11             0.8489            100.0
12             0.8503            100.0
13             0.8796            100.0
14             0.9612            100.0
15             0.9613            100.0
16             0.9618            100.0
17             0.9622            100.0
18             0.9636            100.0


- I have chosen to combine these gender columns into a single binary label named 'is_non-human'. This involves transforming the gender predictions based on a confidence level above a specified threshold.
- Rows with confidence levels below this threshold will be assigned a null value for 'is_non-human', indicating that the reference model is uncertain or unable to make a prediction.

### Create a label column 'is_non-human'
- Merge the female and male gender together to be 0, is non-human will be 1
- any coulm below the confidence threshold will be marked as null

In [9]:
# Set the confidence threshold
confidence_thresh = 0.554

# Exclude golden data set
processed_data = raw_df[raw_df['_golden'] != True].copy()

# Replace NaN values in 'gender:confidence' with 0
processed_data['gender:confidence'].fillna(0, inplace=True)

# Replace NaN values in 'gender' with 'unknown'
processed_data['gender'].fillna('unknown', inplace=True)



In [10]:
processed_data.info()

<class 'pandas.core.frame.DataFrame'>
Index: 20000 entries, 0 to 19999
Data columns (total 26 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   _unit_id               20000 non-null  int64  
 1   _golden                20000 non-null  bool   
 2   _unit_state            20000 non-null  object 
 3   _trusted_judgments     20000 non-null  int64  
 4   _last_judgment_at      20000 non-null  object 
 5   gender                 20000 non-null  object 
 6   gender:confidence      20000 non-null  float64
 7   profile_yn             20000 non-null  object 
 8   profile_yn:confidence  20000 non-null  float64
 9   created                20000 non-null  object 
 10  description            16262 non-null  object 
 11  fav_number             20000 non-null  int64  
 12  gender_gold            0 non-null      object 
 13  link_color             20000 non-null  object 
 14  name                   20000 non-null  object 
 15  profile

In [11]:
# Create the 'is_non-human' column and set default values
processed_data['is_non-human'] = np.nan

# Apply conditions to set 'is_non-human'
processed_data.loc[(processed_data['gender'] == 'brand') & (processed_data['gender:confidence'] > confidence_thresh), 'is_non-human'] = 1
processed_data.loc[(processed_data['gender'] == 'male') & (processed_data['gender:confidence'] > confidence_thresh), 'is_non-human'] = 0
processed_data.loc[(processed_data['gender'] == 'female') & (processed_data['gender:confidence'] > confidence_thresh), 'is_non-human'] = 0

# Convert 'is_non-human' to numeric (float) type
processed_data['is_non-human'] = processed_data['is_non-human'].astype('float')

# Remove the 'gender_gold' and 'profile_yn_gold' columns
processed_data.drop(columns=['gender_gold', 'profile_yn_gold'], inplace=True)

# Drop rows where 'is_non-human' is NaN
processed_data.dropna(subset=['is_non-human'], inplace=True)

# Display the counts of 'is_non-human' values, without NaN value
print(processed_data['is_non-human'].value_counts(dropna=False))

is_non-human
0.0    12458
1.0     5719
Name: count, dtype: int64


### Split the processed_data into train set and test set, using stratify to keep the data representation from different classes

In [17]:
# Split the data into features (X) and target variable (y)
X = processed_data.drop(columns=['is_non-human'])
y = processed_data['is_non-human']

# Perform stratified split into 80/20 train and validation sets
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.1, stratify=y, random_state=42)

# If you want to combine the features and labels back into a DataFrame for train and test sets:
train_df = X_train.join(y_train)
test_df = X_val.join(y_val)

# Display the shapes of the train and validation sets
print("Train set shape:", X_train.shape, y_train.shape)
print("Validation set shape:", X_val.shape, y_val.shape)


Train set shape: (16359, 24) (16359,)
Validation set shape: (1818, 24) (1818,)


In [19]:
# Save train and validation data to CSV files
train_df.to_csv('train_data.csv', index=False)
test_df.to_csv('test_data.csv', index=False)
