## Importing Libraries

In [1]:
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer,TfidfVectorizer
from sklearn.base import TransformerMixin
from sklearn.pipeline import Pipeline

## Loading Data

In [2]:
# Loading CSV file
train = pd.read_csv ('train_E6oV3lV.csv')
test = pd.read_csv ('test_tweets_anuFYb8.csv') 

In [3]:
# Combining Both Datasets
combine = train.append(test,ignore_index=True,sort=True)

In [4]:
# Top 5 records
combine.head()

Unnamed: 0,id,label,tweet
0,1,0.0,@user when a father is dysfunctional and is s...
1,2,0.0,@user @user thanks for #lyft credit i can't us...
2,3,0.0,bihday your majesty
3,4,0.0,#model i love u take with u all the time in ...
4,5,0.0,factsguide: society now #motivation


In [5]:
# Last 5 records
combine.tail()

Unnamed: 0,id,label,tweet
49154,49155,,thought factory: left-right polarisation! #tru...
49155,49156,,feeling like a mermaid ð #hairflip #neverre...
49156,49157,,#hillary #campaigned today in #ohio((omg)) &am...
49157,49158,,"happy, at work conference: right mindset leads..."
49158,49159,,"my song ""so glad"" free download! #shoegaze ..."


In [6]:
# shape of dataframe
combine.shape

(49159, 3)

In [7]:
# View data information
combine.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 49159 entries, 0 to 49158
Data columns (total 3 columns):
id       49159 non-null int64
label    31962 non-null float64
tweet    49159 non-null object
dtypes: float64(1), int64(1), object(1)
memory usage: 1.1+ MB


In [8]:
# Feedback Value count of training dataset
train['label'].value_counts()

0    29720
1     2242
Name: label, dtype: int64

## Tokening the Data With spaCy

Now that we know what we’re working with, let’s create a custom tokenizer function using `spaCy`. We’ll use this function to automatically strip information we don’t need, like stopwords and punctuation, from each review.

We’ll start by importing the English models we need from `spaCy`, as well as Python’s `string` module, which contains a helpful list of all punctuation marks that we can use in `string.punctuation`. We’ll create variables that contain the punctuation marks and stopwords we want to remove, and a parser that runs input through `spaCy‘s` English module.

Then, we’ll create a `spacy_tokenizer()` function that accepts a sentence as input and processes the sentence into tokens, performing lemmatization, lowercasing, and removing stop words. This is similar to what we did in the examples earlier in this tutorial, but now we’re putting it all together into a single function for preprocessing each user review we’re analyzing.

In [9]:
import string
import spacy
from spacy.lang.en.stop_words import STOP_WORDS
from spacy.lang.en import English

# Create our list of punctuation marks
punctuations = string.punctuation

# Create our list of stopwords
nlp = spacy.load('en')
stop_words = spacy.lang.en.stop_words.STOP_WORDS

# Load English tokenizer, tagger, parser, NER and word vectors
parser = English()

# Creating our tokenizer function
def spacy_tokenizer(sentence):
    # Creating our token object, which is used to create documents with linguistic annotations.
    mytokens = parser(sentence)

    # Lemmatizing each token and converting each token into lowercase
    mytokens = [ word.lemma_.lower().strip() if word.lemma_ != "-PRON-" else word.lower_ for word in mytokens ]

    # Removing stop words
    mytokens = [ word for word in mytokens if word not in stop_words and word not in punctuations ]

    # return preprocessed list of tokens
    return mytokens

## Defining a Custom Transformer

To further clean our text data, we’ll also want to create a custom transformer for removing initial and end spaces and converting text into lower case. Here, we will create a custom `predictors` class wich inherits the `TransformerMixin` class. This class overrides the transform, fit and get_parrams methods. We’ll also create a `clean_text()` function that removes spaces and converts text into lowercase.

In [10]:
class predictors(TransformerMixin):
    def transform(self, X, **transform_params):
        # Cleaning Text
        return [clean_text(text) for text in X]

    def fit(self, X, y=None, **fit_params):
        return self

    def get_params(self, deep=True):
        return {}

# Basic function to clean the text
def clean_text(text):
    # Removing spaces and converting text into lowercase
    return text.strip().lower()

## Vectorization

In [11]:
bow_vector = CountVectorizer(tokenizer = spacy_tokenizer, ngram_range=(1,1))

When we classify text, we end up with text snippets matched with their respective labels. But we can’t simply use text strings in our machine learning model; we need a way to convert our text into something that can be represented numerically just like the labels (0 for positive and 1 for negative) are. Classifying text in positive and negative labels is called sentiment analysis. So we need a way to represent our text numerically.

One tool we can use for doing this is called `Bag of Words`. BoW converts text into the matrix of occurrence of words within a given document. It focuses on whether given words occurred or not in the document, and it generates a matrix that we might see referred to as a BoW matrix or a document term matrix.

We can generate a BoW matrix for our text data by using scikit-learn‘s CountVectorizer. In the code below, we’re telling `CountVectorizer` to use the custom `spacy_tokenizer` function we built as its tokenizer, and defining the ngram range we want.

`N-grams` are combinations of adjacent words in a given text, where `n` is the number of words that incuded in the tokens. for example, in the sentence “Who will win the football world cup in 2022?” unigrams would be a sequence of single words such as “who”, “will”, “win” and so on. Bigrams would be a sequence of 2 contiguous words such as “who will”, “will win”, and so on. So the `ngram_range` parameter we’ll use in the code below sets the lower and upper bounds of the our `ngrams` (we’ll be using unigrams). Then we’ll assign the ngrams to bow_vector.

In [12]:
tfidf_vector = TfidfVectorizer(tokenizer = spacy_tokenizer)

## Splitting The Data into Training and Test Sets

In [13]:
from sklearn.model_selection import train_test_split

X = train['tweet'] # the features we want to analyze
y = train['label'] # the labels, or answers, we want to test against

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)

## Creating a Pipeline and Generating the Model

Now that we’re all set up, it’s time to actually build our model! We’ll start by importing the LogisticRegression module and creating a `LogisticRegression` classifier object.

Then, we’ll create a `pipeline` with three components: a cleaner, a vectorizer, and a classifier. The `cleaner` uses our predictors class object to clean and preprocess the text. The `vectorizer` uses countvector objects to create the bag of words matrix for our text. The `classifier` is an object that performs the logistic regression to classify the sentiments.

Once this pipeline is built, we’ll fit the pipeline components using fit().

In [14]:
# Logistic Regression Classifier
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import MultinomialNB
from sklearn.svm import LinearSVC

classifier = LinearSVC()



# Create pipeline using Bag of Words
pipe = Pipeline([("cleaner", predictors()),
                 ('vectorizer', tfidf_vector),
                 ('classifier', classifier)])

# model generation
pipe.fit(X_train,y_train)

Pipeline(memory=None,
     steps=[('cleaner', <__main__.predictors object at 0x000002B38FC57C48>), ('vectorizer', TfidfVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.float64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
    ...ax_iter=1000,
     multi_class='ovr', penalty='l2', random_state=None, tol=0.0001,
     verbose=0))])

## Evaluating the Model

In [15]:
from sklearn.metrics import f1_score,classification_report,confusion_matrix
# Predicting with a test dataset
predicted = pipe.predict(X_test)

In [16]:
print(classification_report(y_test,predicted))

              precision    recall  f1-score   support

           0       0.97      0.99      0.98      8920
           1       0.87      0.62      0.73       669

   micro avg       0.97      0.97      0.97      9589
   macro avg       0.92      0.81      0.85      9589
weighted avg       0.97      0.97      0.96      9589



In [17]:
f1_score(y_test,predicted)

0.7255244755244755

In [18]:
test_tweet = test['tweet']
test_tweet

0        #studiolife #aislife #requires #passion #dedic...
1         @user #white #supremacists want everyone to s...
2        safe ways to heal your #acne!!    #altwaystohe...
3        is the hp and the cursed child book up for res...
4          3rd #bihday to my amazing, hilarious #nephew...
5                              choose to be   :) #momtips 
6        something inside me dies ð¦ð¿â¨  eyes nes...
7        #finished#tattoo#inked#ink#loveitâ¤ï¸ #â¤ï¸...
8         @user @user @user i will never understand why...
9        #delicious   #food #lovelife #capetown mannaep...
10       1000dayswasted - narcosis infinite ep.. make m...
11       one of the world's greatest spoing events   #l...
12       half way through the website now and #allgoing...
13       good food, good life , #enjoy and   ððð...
14       i'll stand behind this #guncontrolplease   #se...
15       i ate,i ate and i ate...ðð   #jamaisasth...
16        @user got my @user limited edition rain or sh.

In [19]:
test_pred = pipe.predict(test_tweet)

In [20]:
test['label'] = test_pred

submission = test[['id','label']]

submission.to_csv('result_spacy.csv', index=False)