# Introduction

Hello people, welcome to my kernel! In this kernel I am going to classificate gender using Twitter Bio (Description) Data. In order to do this I am going to use Natural Language Processing.

Let's take a look at our schedule

# Schedule
1. Importing Libraries and Data
1. Preprocessing Data
1. Classificiation Algorithms
    * Naive Bayes Classification
    * Logistic Regression Classification
    * Random Forest Classification
1. Result
1. Conclusion

# Importing Libraries and Data

In this section I am going to import the libraries and the data that I will use in this kernel. However I am not going to import machine learning libraries, I am going to add these libraries when I will use them

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
from matplotlib import pyplot as plt

import warnings as wrn
wrn.filterwarnings('ignore')

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 5GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [None]:
data = pd.read_csv('/kaggle/input/twitter-user-gender-classification/gender-classifier-DFE-791531.csv',encoding='latin1')

# Preprocessing

In this section I am going to prepare data for machine learning. In this section I am going to follow these steps.

* Dropping Redundant Features
* Dropping Nan Values
* Converting Gender Feature to Int64
* Cleaning Data using Regular Expressions (RE Library)
* Converting Text to Lower
* Converting String to List 
* Lemmatizing
* Converting List to String
* Creating Bag of Words
* Train Test Splitting

I am going to start with dropping redundant features

In [None]:
data = data.loc[:,["gender","description"]]


In [None]:
data.head()

In [None]:
data.dropna(inplace=True)

And now I am going to convert gender feature to Int64

In [None]:
gnd =  [0 if each == "male" else 1 for each in data.gender]
data.gender = gnd

And now I am going to create a for loop and I will apply those steps to each row. But before this, I am going to libraries that I need for preprocessing.

In [None]:
import nltk # Natural Language Tool Kit
import re # Regular Expression

lemma = nltk.WordNetLemmatizer() # Lemmatizer (nltk library)
pattern = "[^a-zA-Z]"


In [None]:
desc_list = []
for each in data.description:
    each = re.sub(pattern," ",each) # Cleaning
    each = each.lower() # Converting to lower
    each = nltk.word_tokenize(each) # Converting string to list
    each = [lemma.lemmatize(each) for each in each] # Lemmatizing
    each = " ".join(each) # Converting list to string
    desc_list.append(each) 

In [None]:
desc_list[:5]

And now I am going to create bag of words. In order to do this I am going to import SKLearn library.

In [None]:
from sklearn.feature_extraction.text import CountVectorizer

most_used = 5000 # Most used 5000 words in bios
cv = CountVectorizer(max_features=most_used,stop_words='english') 

In [None]:
sparce_matrix = cv.fit_transform(desc_list).toarray()
sparce_matrix

Our bag of list is our sparce_matrix array. Sparce matrix is our x axis in machine learning. We are going to use it in classification. Now I am going to split arrays into two pieces, train and split

In [None]:
from sklearn.model_selection import train_test_split

x = sparce_matrix
y = data.gender.values

x_train,x_test,y_train,y_test = train_test_split(x,y,random_state=1,test_size=0.1)

# Classification Algorithms 

Our train and test splits are ready. In this section I am going to train some different classification algorithms and at the end of this section I am going to determine best classification algorithm for this dataset.

I am going to train these classification algorithms:
* Naive Bayes Classification
* Logistic Regression Classification
* Random Forest Classification

## Naive Bayes Classification Algorithm

In this section I am going to train a naive bayes classification model. In order to do this I am going to use SKLearn library.

In [None]:
from sklearn.naive_bayes import GaussianNB
NBC = GaussianNB()
NBC.fit(x_train,y_train)
print(NBC.score(x_test,y_test))

* Our naive bayes classification score is so low. It means that Naive Bayes algorithm is not useful for this dataset.

## Logistic Regression Classification

In this section I am going to train a Logistic Regression model using SKLearn library.

In [None]:
from sklearn.linear_model import LogisticRegression
LR = LogisticRegression()
LR.fit(x_train,y_train)
print(LR.score(x_test,y_test))

* Our logistic regression model's score is so much better than Naive Bayes score.

## Random Forest Classification 

In this section I am going to train a RFC model. In order to do this I am going to use SKLearn library

In [None]:
from sklearn.ensemble import RandomForestClassifier

RFC = RandomForestClassifier(n_estimators=20,random_state=1)

RFC.fit(x_train,y_train)
print(RFC.score(x_test,y_test))

# Result

As we can see, our best classification algorithm for this dataset is Logistic Regression. Maybe Random Forest Classification would have been better but I could not increase n_estimator value, because It has taken too long time to train.


# Conclusion

Thanks for your attention. If there are any mistake or if you want to ask a question about anything, please contact me in comment section. If you upvote this kernel, I would be glad.