Natural language processing (NLP) is a branch of artificial intelligence that helps computers understand, interpret and manipulate human language. NLP draws from many disciplines, including computer science and computational linguistics, in its pursuit to fill the gap between human communication and computer understanding.
* In this notebook, we will learn the basics of the NLP by using Twitter User Gender Classification dataset.
* We will classify the dataset by using Naive Bayes Algorithm

In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

## Import Twitter Data

In [None]:
data = pd.read_csv(r"/kaggle/input/twitter-user-gender-classification/gender-classifier-DFE-791531.csv",encoding = "latin1")
data.head()

In [None]:
data.columns

In [None]:
data.info()

Our aim is to classify the gender from tweets so, we just need gender and description columns.

In [None]:
data = pd.concat([data.gender,data.description],axis=1) # New data contains just two columns
data.dropna(axis = 0,inplace = True) # Drop NaN values
data.gender = [1 if each == "female" else 0 for each in data.gender] # 1 for female, 0 for male
data.gender.value_counts()

In [None]:
data.head()

## Cleaning Data 

### Regular Expression:
* Regular Expression, is a sequence of characters that forms a search pattern.
* RegEx can be used to check if a string contains the specified search pattern.

Firstly, I will show you whole process from one tweet. Then, it will be applied for whole tweets in the dataset

In [None]:
first_description = data.description[4] 
first_description

In [None]:
import re
description = re.sub("[^a-zA-Z]"," ",first_description)  # Except from a to z, and from A to Z will be transform to space
description = description.lower()   # Make whole words lowercase
description

### Stopwords (Irrelavent Words)
* In computing, stop words are words that are filtered out before or after the natural language data (text) are processed. While “stop words” typically refers to the most common words in a language, all-natural language processing tools don’t use a single universal list of stop words.

In [None]:
import nltk # natural language tool kit
nltk.download("stopwords")      
from nltk.corpus import stopwords  
description = nltk.word_tokenize(description) # To split words
description = [ word for word in description if not word in set(stopwords.words("english"))]

In [None]:
description

### Lemmatazation
* For grammatical reasons, documents are going to use different forms of a word, such as organize, organizes, and organizing. Additionally, there are families of derivationally related words with similar meanings, such as democracy, democratic, and democratization. In many situations, it seems as if it would be useful for a search for one of these words to return documents that contain another word in the set.

* The goal of both stemming and lemmatization is to reduce inflectional forms and sometimes derivationally related forms of a word to a common base form. For instance:

* *  am, are, is $\Rightarrow$ be
* * car, cars, car's, cars' $\Rightarrow$ car
* * The result of this mapping of text will be something like:
* * the boy's cars are different colors $\Rightarrow$ the boy car be differ color

In [None]:
import nltk as nlp

lemma = nlp.WordNetLemmatizer()
description = [ lemma.lemmatize(word) for word in description] 

description = " ".join(description)

In [None]:
description

Let's apply these to all tweets with for loop

In [None]:
description_list = []
for description in data.description:
    description = re.sub("[^a-zA-Z]"," ",description)
    description = description.lower()   
    description = nltk.word_tokenize(description)
    description = [ word for word in description if not word in set(stopwords.words("english"))]
    lemma = nlp.WordNetLemmatizer()
    description = [ lemma.lemmatize(word) for word in description]
    description = " ".join(description)
    description_list.append(description)

### Bag of Words
* A bag-of-words model, or BoW for short, is a way of extracting features from text for use in modeling, such as with machine learning algorithms.
* The approach is very simple and flexible, and can be used in a myriad of ways for extracting features from documents.
* A bag-of-words is a representation of text that describes the occurrence of words within a document. It involves two things:

    1. A vocabulary of known words.
    1. A measure of the presence of known words.
    

* It is called a “bag” of words, because any information about the order or structure of words in the document is discarded. The model is only concerned with whether known words occur in the document, not where in the document.
* A very common feature extraction procedures for sentences and documents is the bag-of-words approach (BOW). In this approach, we look at the histogram of the words within the text, i.e. considering each word count as a feature.

In [None]:
from sklearn.feature_extraction.text import CountVectorizer # for bag of words
max_features = 5000
count_vectorizer = CountVectorizer(max_features=max_features,stop_words = "english")
sparce_matrix = count_vectorizer.fit_transform(description_list).toarray()  # x
print("Most Common {} word is {}".format(max_features,count_vectorizer.get_feature_names()))

### Applying Our Machine Learning Model

In [None]:
y = data.iloc[:,0].values   # male or female classes (output)
x = sparce_matrix # our input

import seaborn as sns
import matplotlib.pyplot as plt
# visualize number of digits classes
plt.figure(figsize=(15,7))
sns.countplot(y)
plt.title("Number of Gender")

In [None]:
# train test split
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x,y, test_size = 0.1, random_state = 42)


# naive bayes
from sklearn.naive_bayes import GaussianNB
nb = GaussianNB()
nb.fit(x_train,y_train)

# prediction
y_pred = nb.predict(x_test)

print("Accuracy: ",nb.score(y_pred.reshape(-1,1),y_test))