## __Problem Statement:__

For the given dataset, perform EDA with visualization, I'll formulate 2 questions on the given data and answer the same. Then proceed to to build an ensemble classifier using 3 ML algorithms and find out which algorithm best suits the dataset with respect to the accuracy of the algorithm.
#### __Procedure:__

   1. The dataset is to be analysed and preliminary data cleaning is to be done.
   2. Data exploration and feature engineering are to done for fine tuning of dataset.
   3. ML modelling and accuracy checking to find the optimal algorithm for the dataset.


#### __Questions to be answered at the end of EDA:__
   1. Text analysis based on common words used by Males & Females ?
   2. How significant are the color attributes used by the users ?



In [None]:
# Importing necessary packages 

import numpy as np 
import pandas as pd
from matplotlib import pyplot as plt
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

## __Preliminary Data Assessment__

In [None]:
twitter = pd.read_csv('../input/twitter-user-gender-classification/gender-classifier-DFE-791531.csv',encoding='latin-1')
twitter.head()

In [None]:
twitter.shape

In [None]:
twitter.describe()

In [None]:
twitter.info()

In [None]:
twitter.columns

In [None]:
twitter.isnull().sum()

In [None]:
twitter['tweet_count'].value_counts()

In [None]:
twitter['retweet_count'].value_counts()

In [None]:
sns.barplot (x = 'gender', y = 'tweet_count',data = twitter)

In [None]:
sns.barplot (x = 'gender', y = 'retweet_count',data = twitter)

In [None]:
# Visualizing null values to get a better idea of the dataset & it's trends

plt.subplots(figsize=(15,15))
sns.heatmap(twitter.isnull(), cbar=False)

## __Data Exploration & Feature Engineering__
Here we are going to explore the relationships of the independent and dependent variables, modify the features and look for anomalies to present a better dataset for the ML models.

As by observing the above representations
We will reduce down to only the following columns which are required for ML algorithm implimentation :

   1. 'gender'
   2. 'link_color'
   3. 'sidebar_color'
   4. 'text'
   4. 'description'


In [None]:
#Dropping irrelevant columns from dataset

twitter = twitter.drop(['_unit_id', '_golden', '_unit_state', '_last_judgment_at', 'gender:confidence', 'profile_yn', 'profile_yn:confidence', 
                        'created', 'fav_number', 'gender_gold', 'name', 'profile_yn_gold', 'profileimage', 'retweet_count', 
                        'tweet_coord', 'tweet_count', 'tweet_created', 'tweet_id', 'tweet_location', 'user_timezone', 
                        '_trusted_judgments'], axis = 1)

In [None]:
twitter.head()

In [None]:
twitter['gender'].count()

In [None]:
twitter['gender'].value_counts(dropna=False) 

In [None]:
sns.countplot(twitter['gender'],label="Gender")

### __Text Analysis :__

In [None]:
# dropping all the null values from 'gender'

twitter = twitter.dropna(subset=['gender'],how ='any')  
twitter.head()

In [None]:
# Merging the 'text' & 'description' to combine all sorts of text and then find out common words

twitter['text_description'] = twitter['text'].str.cat(twitter['description'], sep=' ')

In [None]:
twitter = twitter.drop(['description','text'],axis=1)

In [None]:
twitter.head()

### Text cleaning
In this phase, we will filtering out text and perform other functions like Normalizing, Lemmatizing etc

In [None]:
# Junk words & letters other than the English vocab words are filtered out
import re
def cleaning(s):
    s = str(s)
    s = s.lower()
    s = s.replace(",","")
    s = re.sub('[!@#$_]', '', s)
    s = re.sub('\W,\s',' ',s)
    s = re.sub(r'[^\w]', ' ', s)
    s = re.sub('\s\W',' ',s)
    s = re.sub("\d+", "", s)
    s = re.sub('\s+',' ',s)
    s = s.replace("co","")
    s = s.replace("https","")
    s = s.replace("[\w*"," ")
    return s

twitter['text_description'] = [cleaning(s) for s in twitter['text_description']]
twitter.head()

#### Removing __Stop words__ from 'text_description'

In [None]:
from collections import Counter
words = Counter()
for twit in twitter['text_description']:
    for x in twit.split(' '):
        words[x] += 1

words.most_common(20)

These are the number of most common stopwords used in the whole dataset
These words are considered 'noise' which can be eliminated

In [None]:
# Filtering out 'text_description' and printing most commonly used words by elimination stopwords

from nltk.corpus import stopwords
stopwords = stopwords.words('english')
words_filtered = Counter()
for x, y in words.items():
    if not x in stopwords:
        words_filtered[x]=y

words_filtered.most_common(20)

## <u>__Answer 1__ : The most used words by the users are words like Love, Like, Life, Time etc </u>

There is still some trash to clear out such as HTML tags, emojis & unfinished words

In [None]:
# This will clear out the rest of the remaining junk

import re
def preprocessor(text_description):
    text_description = re.sub("[^a-zA-z]", " ",text_description)
    text_description = re.sub('<[^>]*>', '', text_description)
    emoticons = re.findall('(?::|;|=)(?:-)?(?:\)|\(|D|P)', text_description)
    text_description = (re.sub('[\W]+', ' ', text_description.lower()) + ' ' + ' '.join(emoticons).replace('-', ''))
    return text_description

#### __Lemmatization__
For reducing our vocabulary and consolidate words to their roots, we'll use __stemming / Lemmatizing__ 
We will be using __Porter algorithm__ for stemming


In [None]:
from nltk.stem import PorterStemmer

porter = PorterStemmer()

def tokenizer(text_description): #tokenizer to break down our twits in individual words
    return text_description.split()

def tokenizer_porter(text_description):
    return [porter.stem(word) for word in text_description.split()]

In [None]:
twitter.text_description


### __Color attribute analysis :__

#### Side bar color >>>

In [None]:
male_sidebar_color = twitter[twitter['gender'] == 'male']['sidebar_color'].value_counts().head(7)
male_sidebar_color_idx = male_sidebar_color.index
male_top_color = male_sidebar_color_idx.values

male_top_color[2] = '000000'
print (male_top_color)

l = lambda x: '#'+x

sns.set_style("darkgrid")
sns.barplot (x = male_sidebar_color, y = male_top_color) 

In [None]:
female_sidebar_color = twitter[twitter['gender'] == 'female']['sidebar_color'].value_counts().head(7)
female_sidebar_color_idx = female_sidebar_color.index
female_top_color = female_sidebar_color_idx.values

female_top_color[2] = '000000'
print (female_top_color)

l = lambda x: '#'+x

sns.set_style("darkgrid")
sns.barplot (x = female_sidebar_color, y = female_top_color)

#### Link color >>>

In [None]:
male_link_color = twitter[twitter['gender'] == 'male']['link_color'].value_counts().head(7)
male_link_color_idx = male_link_color.index
male_top_color = male_link_color_idx.values
male_top_color[1] = '009999'
male_top_color[5] = '000000'
print(male_top_color)

l = lambda x: '#'+x

sns.set_style("whitegrid", {"axes.facecolor": "white"})
sns.barplot (x = male_link_color, y = male_link_color_idx)

In [None]:
female_link_color = twitter[twitter['gender'] == 'female']['link_color'].value_counts().head(7)
female_link_color_idx = female_link_color.index
female_top_color = female_link_color_idx.values

l = lambda x: '#'+x

sns.set_style("whitegrid", {"axes.facecolor": "white"})
sns.barplot (x = female_link_color, y = female_link_color_idx, palette=list(map(l, female_top_color)))

#### As seen from plots displayed above, most users have not changed the default color of their profile, but if these are discarded, then there is significant dataset to be used for classification.

## <u>__Answer 2__ : The most primiarly used color for both 'sidebar' & 'link color' is Blue followed by Orange and the rest of them. </u>

## __Training & Testing of ML algorithms__
The following classifiers have been chosen for training on the dataset :-

    1. Logistic Regression
    2. Random forest
    3. SVM Classifier

The ML algorithms are trained on each feature of the dataset and the algorithm with the maximum accuracy is the most optimal model for this dataset and the feature that gives maximum accuracy is the optimal feature for classification of this data.

## Training for Text :

In [None]:
# The frequency of the words will be helpful in classifying the gender of the users.

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder

# Setting up training and testing data 
encoder = LabelEncoder()
y = encoder.fit_transform(twitter['gender'])
X = twitter['text_description']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0, stratify=y)

#### Modelling on Logistic Regression >>>

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import accuracy_score

tfidf = TfidfVectorizer(lowercase=False,
                        tokenizer=tokenizer_porter)
clf = Pipeline([('vect', tfidf),
                ('clf', LogisticRegression(multi_class='ovr', random_state=0))])

clf.fit(X_train, y_train)

predictions = clf.predict(X_test)
print('Accuracy:',accuracy_score(y_test,predictions))

#### Modelling on Random Forest >>>

In [None]:
from sklearn.ensemble import RandomForestClassifier

n = range (1,100,10)

tfidf = TfidfVectorizer(lowercase=False,
                        tokenizer=tokenizer_porter)
clf = Pipeline([('vect', tfidf),
                ('clf', RandomForestClassifier(n_estimators = 40, random_state=0))])

clf.fit(X_train, y_train)

predictions = clf.predict(X_test)
print('Accuracy:',accuracy_score(y_test,predictions))

#### Modelling on SVM >>>

In [None]:
from sklearn.svm import SVC

tfidf = TfidfVectorizer(lowercase=False,
                        tokenizer=tokenizer_porter)
clf = Pipeline([('vect', tfidf),
                ('clf', SVC(kernel = 'linear'))])

clf.fit(X_train, y_train)

predictions = clf.predict(X_test)
print('Accuracy:',accuracy_score(y_test,predictions))

### __Experimental Results__

Accuracy:

    Logistic Regression: 57.90%
    Random Forest: 53.92%
    SVM: 57.85%

## <u>Winner: __Logistic Regression__ model</u>


## Training for color attributes :

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder

encoder = LabelEncoder()
y = encoder.fit_transform(twitter['gender'])
X = twitter['sidebar_color']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0, stratify=y)

#### Modelling on Logistic Regression(sidebar_color) >>>

In [None]:
from sklearn.linear_model import LogisticRegression

tfidf = TfidfVectorizer(lowercase=False,
                        tokenizer=tokenizer_porter)
clf = Pipeline([('vect', tfidf),
                ('clf', LogisticRegression(multi_class='ovr', random_state=0))])

clf.fit(X_train, y_train)

predictions = clf.predict(X_test)
print('Accuracy:',accuracy_score(y_test,predictions))

#### Modelling for Random Forest(sidebar_color) >>>

In [None]:
from sklearn.ensemble import RandomForestClassifier

n = range (1,100,10)

tfidf = TfidfVectorizer(lowercase=False,
                        tokenizer=tokenizer_porter)
clf = Pipeline([('vect', tfidf),
                ('clf', RandomForestClassifier(n_estimators = 40, random_state=0))])

clf.fit(X_train, y_train)

predictions = clf.predict(X_test)
print('Accuracy:',accuracy_score(y_test,predictions))

#### Modelling for SVM(sidebar_color) >>>

In [None]:
from sklearn.svm import SVC

tfidf = TfidfVectorizer(lowercase=False,
                        tokenizer=tokenizer_porter)
clf = Pipeline([('vect', tfidf),
                ('clf', SVC(kernel = 'linear'))])
clf.fit(X_train, y_train)

predictions = clf.predict(X_test)
print('Accuracy:',accuracy_score(y_test,predictions))

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder

encoder = LabelEncoder()
y = encoder.fit_transform(twitter['gender'])
X = twitter['link_color']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0, stratify=y)

#### Modelling on Logistic Regression(link_color) >>>

In [None]:
from sklearn.linear_model import LogisticRegression

tfidf = TfidfVectorizer(lowercase=False,
                        tokenizer=tokenizer_porter)
clf = Pipeline([('vect', tfidf),
                ('clf', LogisticRegression(multi_class='ovr', random_state=0))])

clf.fit(X_train, y_train)

predictions = clf.predict(X_test)
print('Accuracy:',accuracy_score(y_test,predictions))

#### Modelling for Random Forest(link_color) >>>

In [None]:
from sklearn.ensemble import RandomForestClassifier

n = range (1,100,10)

tfidf = TfidfVectorizer(lowercase=False,
                        tokenizer=tokenizer_porter)
clf = Pipeline([('vect', tfidf),
                ('clf', RandomForestClassifier(n_estimators = 40, random_state=0))])

clf.fit(X_train, y_train)

predictions = clf.predict(X_test)
print('Accuracy:',accuracy_score(y_test,predictions))

#### Modelling for SVM(link_color) >>>

In [None]:
from sklearn.svm import SVC

tfidf = TfidfVectorizer(lowercase=False,
                        tokenizer=tokenizer_porter)
clf = Pipeline([('vect', tfidf),
                ('clf', SVC(kernel = 'linear'))])
clf.fit(X_train, y_train)

predictions = clf.predict(X_test)
print('Accuracy:',accuracy_score(y_test,predictions))

### __Experimental Results__

Accuracy for 'sidebar_color':

    Logistic Regression: 37.71%
    Random Forest: 37.62%
    SVM: 37.77%

## <u>Winner: __SVM__ model</u>

Accuracy for 'link_color':

    Logistic Regression: 40.34%
    Random Forest: 40.47%
    SVM: 40.36%

## <u>Winner: __Random Forest__ model</u>


