# Introduction

Hello people, welcome to this kernel! In this kernel I am going to classify texts. Before the starting let's take a look at our schedule.

# Schedule
1. Importing Libraries and Data
1. Data Overview
1. Preprocessing
    * Dropping Unnamed Features
    * Converting Y Axis to Int64
    * Preparing Texts for Count Vectorizer (Bag of Words)
1. Bag of Words
1. Text Classification
    * Importing Classifers
    * Train Test Split
    * Naive Bayes Classification
    * Random Forest Classification
    * Decision Tree Classification
    * Logistic Regression    
1. Result
1. Conclusion

# Importing Libraries and The Data

In this section I am going to import libraries that I will use However I am not going to import classification algorithms because I am going to import them when I need.

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load


"""
Data Manipulating
"""
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

"""
NLTK (Natural Language Tool Kit) | RE (Regular Expressions) | Count Vectorizer (SKLearn)
"""
import nltk
import re
from sklearn.feature_extraction.text import CountVectorizer


# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory


import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 5GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [None]:
data = pd.read_csv('/kaggle/input/sms-spam-collection-dataset/spam.csv',encoding="latin1")

# Data Overview

In this section I am going to examine data. In order to do this I am going to use head(),info() and isnull() methods.

In [None]:
data.head()

* There are unrellevant features in our dataset. In the future we will drop them.

In [None]:
data.info()

In [None]:
data.isnull().sum()

* As we can see there is no missing values in important features, the other ones have missing values but this is not important.

# Preprocessing

In this section I am going to prepare dataset for count vectorizer in order to do this I am going to use a for loop.

In [None]:
x = data.drop("v1",axis=1)
y = data.v1

* At the beginning I've wanted to separete these two axis because I am not going to apply those to y axis.

And now I am going to drop Unnamed Features

In [None]:
for ftr in x:  # Each iterable is a feature name
    if ftr != "v2":
        
        x.drop(ftr,axis=1,inplace=True)
        
x.info()

And now I am going to convert y axis to int. In order to do this I am going to use list comprehension

In [None]:
y = [1 if each == "ham" else 0 for each in y]
y[:5]

In [None]:
lemma = nltk.WordNetLemmatizer()

* I've created a lemma object that helps us in lemmatizing words. In order to do this I've used nltk library's WordNetLemmatizer

In [None]:
new_x = []
pattern = "[^a-zA-Z]"
for txt in x["v2"]:
    
    txt = re.sub(pattern," ",txt) #Cleaning
    txt = txt.lower() # Lowering
    txt = nltk.word_tokenize(txt) #Tokenizing
    txt = [lemma.lemmatize(each) for each in txt] # Lemmatizing
    txt = " ".join(txt) # Joining
    new_x.append(txt) # Appending 
    

In [None]:
new_x[:5]

What did we do in this operation:
* We've used RE library's sub method. This method drops characters except for A-Z.
* We've lowered texts, because in programming LIKE and like are different.
* We've splitted texts into words.
* We've lemmatized each words in a text
* And finally we've joined them in a string


# Bag Of Words

In this section I am going to create bag of words. In order to do this I am going to use CountVectorizer method that I imported previously.

In [None]:
CV = CountVectorizer(stop_words='english')
sparce_matrix = CV.fit_transform(new_x).toarray()

What did we do in this operation:
* In the beginning, we've created an count vectorizer object
* And ,we said drop stop_words in English
* And at the beginning we've sent x our x axis to our object

### What is Sparce Matrix
It is the other name of bag of words


### What is Stopwords?
Stopwords are words that unrellevant for text classification. Therefore we have to drop them. For instance in english *and*,*or* are stopwords.

In [None]:
x = sparce_matrix

# Text Classification
In this section I am going to use our sparce matrix for classification. In order to do this I am going to use different algorithms from SKLearn library. 

In this kernel I am going to use these classification algorithms:
* Naive Bayes Classification
* Random Forest Classification
* Decision Tree Classification
* Logistic Regression    

In the beginning of this section I am going to import libraries that I need

## Importing Algorithms

In [None]:
from sklearn.model_selection import train_test_split #Splitter
from sklearn.naive_bayes import GaussianNB # Naive Bayes
from sklearn.ensemble import RandomForestClassifier # Random Forest
from sklearn.tree import DecisionTreeClassifier # Decision Tree
from sklearn.linear_model import LogisticRegression # Logistic Regression


## Train Test Splitting

In this section I am going to splitd dataframe into train and test. In order to do this I am going to use train_test_split method from sklearn library.

In [None]:
x_train,x_test,y_train,y_test = train_test_split(x,y,test_size=0.2,random_state=1)
print("x_train len is ",len(x_train))
print("x_test len is",len(x_test))
print("y_train len is",len(y_train))
print("y_test len is",len(y_test))

## Naive Bayes Classification

In this sub-section I am going to create a naive bayes classifier

In [None]:
gnb = GaussianNB()
gnb.fit(x_train,y_train)
gnb.score(x_test,y_test)

Our score is %87 for Naive Bayes. It's great!

## Random Forest Classification
In this section I am going to train a RFC model.

In [None]:
rfc = RandomForestClassifier(n_estimators=50,random_state=1)
rfc.fit(x_train,y_train)
rfc.score(x_test,y_test)

Our Random Forest Classification result is %98. It is better than NB

## Decision Tree Classification
In this section I am going to train a DTC model.

In [None]:
dtc = DecisionTreeClassifier(random_state=1)
dtc.fit(x_train,y_train)
dtc.score(x_test,y_test)

Our score is very similar with RFC score.

## Logistic Regression Classification

In this section I am going to train a LogReg model.

In [None]:
logreg = LogisticRegression(random_state=1)
logreg.fit(x_train,y_train)
logreg.score(x_test,y_test)

And our best score came from logistic regression. 

# Result
Let's take a look at our scores.

Naive Bayes: %87
Random Forest: %98.3
Decision Tree: %98
Logistic Regression : %98.6

So, we can use Logistic Regression and Random Forest Classifier for spam text classification

# Conclusion

Thanks for your attention. If you upvote this kernel I would be glad. 

And if you have any question you can ask me, I am going to answer them as much as I can.
