In [None]:
import os
import numpy as np # linear algebra
import pandas as pd # data processing
import matplotlib.pyplot as plt # visualize data

from sklearn.pipeline import Pipeline

<h1>Introduction</h1>
This notebook is presenting an approach to solve a text-classification problem with machine learning techniques. The solution is including three steps - data understanding, text pre-processing and classification.<br>

The purpose is to train a model with our clean text dataset, that will be able to make predictions.

In [None]:
## Spoiler Alert!
## File Configurations
data_reduction = False
skip_early_classification = True
skip_initial_posts_classification = True
skip_first_cleaning = True
skip_clean_posts_classification = True
skip_second_clean_posts_classification = True
skip_lemma_posts_classification = True
skip_lemmatazation = True

save_data_to_files = False


# TPOT configurations
generations = 5
population_size = 5
config_dict='TPOT sparse'
verbosity = 1
memory='auto'

<h2><a id = Test></a>Data</h2>

By the dataset's description we know the following:
<ul>
    <li>This data was collected through the <a href="http://personalitycafe.com/">PersonalityCafe</a> forum, as it provides a large selection of people and their MBTI personality type, as well as what they have written.</li>
    <li>The dataset is consist of : 
        <ul>
        <li> 2 categorical columns (<code>type</code> and <code>posts</code>) and</li>  
        <li> 8675 rows, each row represents a user.</li>
        </ul>
    </li>  
</ul>
So, each user has a personality <code>type</code> and has posted some <code>posts</code>.<br>

In [None]:
dataset=pd.read_csv("/kaggle/input/mbti-type/mbti_1.csv")
dataset.head(2)

In [None]:
dataset.describe()

Let's use an module called <code>pandas-profiling</code> to extract some basic insights.

In [None]:
from pandas_profiling import ProfileReport

In [None]:
profile = ProfileReport(dataset)
profile.to_notebook_iframe()

We have the following insights:
<ul>
    <li> <code>type</code> column is our target column and it has 16 unique values/classes.</li>
    <li> there are minority classes and majority classes, so the dataset is imbalanced.</li>
    <li> There are no missing values</li>
    <li> There are links that we have to process</li>
    <li> There are <code>|||</code> that seperate posts in each line.</li>
    <li> There may be lines with no latin words.</li>
    
</ul>

In the end of the day, we have to solve a classification problem.  I want to know if we train a model with the initial texts, what the accuracy will be. Next, we will do some text-cleanning and then we will be able to compare new models' accuracy with the initial one.<br>
It may be too early for this, but let's see what the accuracy of a simple model is.<br>
Obviously, the classification's features are the <code>posts</code>'s text, and target column is the <code>type</code> column.<br>
Sad to say, models can train only with numeric values, so we have to convert both posts and type into numbers. We will transform them with diferrent ways.

Due to imbalance problem and cpu runtime limit, I remove some instances of majority classes. In the end, we will have no more than 100 instances per class.

In [None]:
if data_reduction:
    data = pd.DataFrame(columns=dataset.columns)
    data_len =100
    for personality in dataset["type"].unique():
        data=pd.concat(  [ data,dataset.loc[dataset['type']==personality][0 : data_len]])
else:
    data=dataset

In [None]:
profile = ProfileReport(data)
profile.to_notebook_iframe()

Now that dataset is more balanced we can continue.

<h3>Text column's transformation.</h3>
There are many ways to do this, I choose TfidfVectorizer from scikit-learn, because it is fast and it tends to improve models' accuracy.
TfidfVectorizer will transform words to numbers. For each person/row a new row will be showing how frequently each word is used. Finaly a document-word_frequency matrix will be yield.

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [None]:
tfidf = TfidfVectorizer(stop_words='english')
X = tfidf.fit_transform(data['posts'])
print("The ducument-term matrix has {} rows (documents) and {} columns (total words).".format(data.shape[0],len(tfidf.get_feature_names())))

What did <code>TfidfVectorizer</code> return is the following.

In [None]:
pd.DataFrame(X.toarray(),index=data.index,columns=tfidf.get_feature_names())

That's a lot of columns. Let's try to remove words with less than 2 instances throughout all the dataset.

In [None]:
tfidf2 = TfidfVectorizer(stop_words='english',min_df=2)
tfidf2.fit(data['posts'])

print("The ducument-term matrix has {} rows (documents) and {} columns (total words).".format(data.shape[0],len(tfidf2.get_feature_names())))

<h3>Target column transformation</h3>
We will transform personality type column into a new one num_type.

In [None]:
from sklearn.preprocessing import LabelEncoder

In [None]:
type_le = LabelEncoder()
data['num_type']=type_le.fit_transform(data['type'])
y=data['num_type']

<h1>First classification try.</h1>

We will try to messure the accuracy of some models' predictions. In order to do this, we will use TPOT module. 

Consider TPOT your Data Science Assistant. TPOT is a Python Automated Machine Learning tool that optimizes machine learning pipelines using genetic programming. More info http://epistasislab.github.io/tpot/ <br>
TPOTClassifier is what we need. At the moment we do not have to figure out the best model, just a simple accuracy number.
In addition to this, dataset is very large and limited RAM is an issue.
So, <code>TPOTClassifier</code> with TPOT_sparce configuration is the best for us. <code>TPOT_sparce</code> is a list of preproccesors and estimators that run fast on sparce matrices.


These are the classifiers and transformers TPOTClassifier is able to use (on 'sparce' mode).

In [None]:
from tpot import TPOTClassifier

from tpot.config.classifier_sparse import  classifier_config_sparse
[estimator for estimator in classifier_config_sparse ]


In order to split the dataset we will use <code>train_test_split</code> and we will keep 10% for test.

In [None]:
from sklearn.model_selection import train_test_split

from sklearn.feature_selection import SelectFwe
from sklearn.naive_bayes import MultinomialNB

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.1, random_state=42,shuffle=True)

  
if skip_early_classification :
    tpot = Pipeline([('selectfwe', SelectFwe(alpha=0.006)), ('multinomialnb', MultinomialNB(alpha=0.01, fit_prior=False))])
    tpot.fit(X_train ,y_train)
    steps = tpot.steps
else:
    tpot = TPOTClassifier(generations=generations,verbosity=verbosity,population_size=population_size,
                      config_dict=config_dict,n_jobs=-1,random_state=42)
    tpot.fit(X_train ,y_train)
    steps = tpot.fitted_pipeline_.steps

pure_dataset_score  = tpot.score(X_test,y_test)*100
print("First classification's accuracy is {:.2f} % and model's steps are: {}".format(pure_dataset_score,steps))

<h3>Now we can take a closer look at our dataset.</h3>

<code>posts</code> column contains posts of each person. Each cell is made up of posts separated by <code>|||</code>. How many posts does each line contains?

In [None]:
data['#_posts'] = data['posts'].apply(lambda x : len(x.split("|||")))
display(data['#_posts'].describe())

In [None]:
print("There are {} rows with less than 50 posts. This is the {:.2f} % of the dataset.\n".format( len(data[data['#_posts']<50]),len(data[data['#_posts']<50])/data.shape[0]*100 ) )
print("There are {} rows with less than 40 posts. This is the {:.2f} % of the dataset.\n".format( len(data[data['#_posts']<40]),len(data[data['#_posts']<40])/data.shape[0]*100 ) )
data[data['#_posts']<50]['#_posts'].sort_values().plot(kind='bar',title="Less than 50",figsize=(15,4)) ; plt.show()

print("\nThere are {} rows with more than 50 posts. This is the {:.2f} % of the dataset.\n".format( len(data[data['#_posts']>50]),len(data[data['#_posts']>50])/data.shape[0]*100 ) )

In [None]:
data.groupby('#_posts').sum().plot(kind='bar',title="#_posts variation",figsize=(15,4)) ; plt.show()

What is beneath of 50 ?

In [None]:
tmp = data[data['#_posts']<50].groupby('type').count()[['#_posts']].join( data.groupby('type').count()[['#_posts']],lsuffix="<50",rsuffix='_total' )
tmp['Contain (%)'] = round(tmp['#_posts<50'] / tmp['#_posts_total'] *100)
tmp.sort_values('Contain (%)',ascending=False)

From the above we see that lines with less than 50 posts compose more than 10% of most classes. So we can not assume them as noise and we can not remove them.
<br>So, most of the lines contains 50 posts.<br>

Let's see if there is any **correlation** beetwen number of posts and personality post.

In [None]:
data.groupby('type').median()[['#_posts']].join(data.groupby('type').mean()[['#_posts']],lsuffix='_median',rsuffix='_mean')

In [None]:
np.corrcoef(data['#_posts'],data['num_type'])[0][1]

We can not consider that there is any correlation.

In [None]:
data['posts']

If we want to do some classification, we should remove content like urls, contractions and numbers. Also, capitals letters will be tranformed into lower and characters that are repeated more than two times in a row, we will keep only the first two of them (e.x. "aaaaaand" will be "aand").

<h1> Text Cleaning</h1>

Again, we check how many words are included in our dataset.

In [None]:
voc = tfidf.vocabulary_
print("Basic post's vocabulary contains {} words.".format(len(voc)))

We import two libraries for text processing. <br>
<code>re</code> stands for Regular Expresions, it will help in finding spesific character patterns in texts and replace with new ones. For more info <a href="https://docs.python.org/3/library/re.html">here</a> .<br>
<code>nltk</code> stands for Natural Language Toolkit, NLTK is a leading platform for building Python programs to work with human language data. It provides easy-to-use interfaces to over 50 corpora and lexical resources such as WordNet, along with a suite of text processing libraries for classification, tokenization, stemming, tagging, parsing, and semantic reasoning, wrappers for industrial-strength NLP libraries. For more info <a href=https://www.nltk.org/>here</a>.

In [None]:
import re

import string
punctuation = string.punctuation

In [None]:
from spacy.lang.en import English
nlp = English()
from spacy.lang.en.stop_words import STOP_WORDS as stopwords

In [None]:
givenpatterns =[(r'(\|\|\|)',r' ||| '),
                (r'http\S+', r''),
                (r"(.)\1{2,}",r"\1\1"),
                (r'[0-9]\S+',r''),
                (r"'",r''),
                (r' {2,}',r' ')]

def cleaning(text):
    text = text.lower()
    
    text = re.sub(r'(\|\|\|)',r' ||| ',text)
    text = re.sub(r'http\S+', r' URL ',text)
    
    doc = nlp(text)
    text = " ".join([word.text for word in doc if not word.text in stopwords])
    
    for token in punctuation.replace("'",'').replace("|",''):
        text = text.replace(token,' ')
        
    for (raw,rep) in givenpatterns:
        regex = re.compile(raw)
        text = regex.sub(rep,text)
        
    #This is a try to remove non-Latin words.    
    text =re.sub(r'[^\x00-\x7F\x80-\xFF\u0100-\u017F\u0180-\u024F\u1E00-\u1EFF]', u'', text)
    
    return text

In [None]:
text = data['posts'][0]
text

In [None]:
cleaning(text)

In [None]:
if skip_first_cleaning:
    clean_posts = pd.read_csv("/kaggle/input/only-df-clean-posts/first_phase_clean_posts.csv",index_col=0)
    data=data.join(clean_posts)
else:
    # This is a time consuming process...
    data['clean_posts']=data['posts'].apply(cleaning)

In [None]:
data.head(5)

In [None]:
from sklearn.feature_extraction.text import CountVectorizer

In [None]:
clean_cv = CountVectorizer().fit(data['clean_posts'])
clean_voc = clean_cv.vocabulary_
print("After cleaning, dataset contains {} words. This is {:.2f}% of the initial dataset's total words.".format(len(clean_voc),len(clean_voc)/len(voc)*100))

In [None]:
X=TfidfVectorizer( stop_words='english').fit_transform(data['clean_posts'])

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1, random_state=42,shuffle=True)

if skip_clean_posts_classification :

    tpot = Pipeline([('selectfwe', SelectFwe(alpha=0.006)), ('multinomialnb', MultinomialNB(alpha=0.01, fit_prior=False))])
    tpot.fit(X_train ,y_train)
    steps = tpot.steps
else:
    tpot = TPOTClassifier(generations=generations,verbosity=verbosity,
                              population_size=population_size,config_dict=config_dict,memory=memory,
                              n_jobs=-1,random_state=42)

    tpot.fit(X_train ,y_train)
    steps = tpot.fitted_pipeline_.steps
    
clean_dataset_score = tpot.score(X_test,y_test)*100
print("With clean data we have {:.2f} % accuracy. And the best pipeline is: {}".format(
                    clean_dataset_score ,steps)  )

In [None]:
print("Score without cleaning: \t{:.2f} %".format(pure_dataset_score))
print("Score after first cleaning: \t{:.2f} %".format(clean_dataset_score))

<h3>Now we have cleaned up our dataset, let's take a look at vocabulary.<br>
    Also, we can check each personality type's top words.

In [None]:
DTM = pd.DataFrame(clean_cv.transform(data['clean_posts']).toarray(),index=data.index,columns=clean_cv.get_feature_names()).join(data['type'],rsuffix="_pers")
DTM

In [None]:
# DTM=DTM.join(data['type'],rsuffix="_pers")
fr= DTM.groupby('type_pers').sum().transpose()
fr

In [None]:
import wordcloud as wc

In [None]:
fig, ax = plt.subplots(len(data['type'].unique()), sharex=True, figsize=(20,15))
fig.patch.set_facecolor('xkcd:tan')

wordcloud = wc.WordCloud(stopwords=None,background_color='white',relative_scaling=1,max_font_size=100 ,normalize_plurals=False)

for i,pers in enumerate(data['type'].unique(),start=1):
    plt.subplot(4,4,i)
    scores = fr[pers].sort_values(ascending=False)[:10]
    wordcloud.fit_words(scores) 
    plt.imshow(wordcloud,interpolation='bilinear'); plt.title(pers); plt.axis('off')
plt.show()

In [None]:
for pers in DTM['type_pers'].unique():
    scores = fr[pers].sort_values(ascending=False)[:10]
    print("###",pers,": ",scores.index.values,"\n")

So, we see that people like discusing personality types, this is expected because dataset comes from a forum for personalities. Moreover, people like discusing their own personality more frequent than the other ones. This is logical but it is wrong to train a model with this dataset. We have to remove all personalities' references.

<h2 name= "Text Cleaning">Text cleaning</h2> (round 2)

In [None]:
personalities = [personality.lower() for personality in data['type'].unique()]

def second_cleaning(text):
    for preson in personalities:
        text = re.sub(r's*|'.join(personalities),"",text)
    return text

In [None]:
data['second_clean_posts'] = data['clean_posts'].apply(second_cleaning)

In [None]:
second_clean_cv = CountVectorizer().fit(data['second_clean_posts'])
second_clean_voc = second_clean_cv.vocabulary_
print("After second cleaning, dataset contains {} words.This is {:.2f}% of the initial dataset's total words.".format(len(second_clean_voc),len(second_clean_voc)/len(voc)*100))

In [None]:
X=TfidfVectorizer(stop_words='english').fit_transform(data['second_clean_posts'])

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1, random_state=42,shuffle=True)
if skip_second_clean_posts_classification:
    
    tpot = Pipeline([('selectfwe', SelectFwe(alpha=0.006)), ('multinomialnb', MultinomialNB(alpha=0.01, fit_prior=False))])
    tpot.fit(X_train ,y_train)
    steps = tpot.steps
else:
    
    tpot = TPOTClassifier(generations=generations,verbosity=verbosity, max_time_mins=500,
                              population_size=population_size,config_dict=config_dict,memory=memory,
                              n_jobs=-1,random_state=42)
    
    tpot.fit(X_train ,y_train)
    steps = tpot.fitted_pipeline_.steps
                          

second_clean_dataset_score = tpot.score(X_test,y_test)*100
print("With double clean data we have {:.2f} % accuracy. And the best pipeline is: {}".format(second_clean_dataset_score,steps))

<h3>We can keep cleaning the dataset, but we stop here.</h3>

In [None]:
print("Score without cleaning: \t{:.2f} %".format(pure_dataset_score))
print("Score after first cleaning: \t{:.2f} %".format(clean_dataset_score))
print("Score after second cleaning: \t{:.2f} %".format(second_clean_dataset_score))

In [None]:
if save_data_to_files:
    dataset['clean_posts'] = dataset['posts'].apply(cleaning)
    dataset['second_clean_posts'] = dataset['clean_posts'].apply(second_cleaning)
    dataset[['type','second_clean_posts']].to_csv("second_clean_posts.csv")
else:
    print("Not Saved!")

In [None]:
data

<h2>Lemmatisation</h2>

Lemmatisation is the process which we transform different inflected forms of a word into the basic word. In order to do this, I use <a href="https://spacy.io/" >spaCy</a> library. <br>
**spaCy** is a python library for natural language process (nlp). It contains a variety of models, which are trained to extract information from text data in a lot of different languages.<br>
In detail, I use the spaCy's model for english language <code>en_core_web_sm</code>.

In [None]:
import spacy
nlp = spacy.load("en_core_web_sm")

def spacy_lemma(text):
    doc = nlp(text)
    return ' '.join([word.lemma_ for word in doc])

In [None]:
if skip_lemmatazation :
    data = data.join(pd.read_csv("/kaggle/input/only-df-clean-posts/lemmatized_posts.csv",index_col=0) )
else:
    data['spacy_lemma'] = data['clean_posts'].map(spacy_lemma)

In [None]:
data['spacy_lemma'].head()

In [None]:
if save_data_to_files:
    data['spacy_lemma'].to_csv("lemmatized_posts.csv")

In [None]:
X=TfidfVectorizer( stop_words='english').fit_transform(data['spacy_lemma'])

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1, random_state=42,shuffle=True)
if skip_lemma_posts_classification:
    
    tpot = Pipeline([('selectfwe', SelectFwe(alpha=0.006)), ('multinomialnb', MultinomialNB(alpha=0.01, fit_prior=False))])
    tpot.fit(X_train ,y_train)
    steps = tpot.steps
else:
    
    tpot = TPOTClassifier(generations=generations,verbosity=verbosity, max_time_mins=500,
                              population_size=population_size,config_dict=config_dict,memory=memory,
                              n_jobs=-1,random_state=42)
    
    tpot.fit(X_train ,y_train)
    steps = tpot.fitted_pipeline_.steps
                          

lemma_post_dataset_score = tpot.score(X_test,y_test)*100
print("With lemmatised data we have {:.2f} % accuracy. And the best pipeline is: {}".format(second_clean_dataset_score,steps))

In [None]:
print("Score without cleaning: \t{:.2f} %".format(pure_dataset_score))
print("Score after first cleaning: \t{:.2f} %".format(clean_dataset_score))
print("Score after second cleaning: \t{:.2f} %".format(second_clean_dataset_score))
print("Score after lemmatization: \t{:.2f} %".format(lemma_post_dataset_score))

I want to remind you that the data which has been used to train and test the models above is reduced. There is still the issue with imbalanced dataset.