<H1>Sentiment Analysis</H1>
<br>
Internet is reaching to more and more people everyday. There is more and more interaction among people through social networking sites. While we have plenty of positives, we can't deny existence of onlune bullying. Sentiment analysis can help us tackle this. Here's an attempt at doing that using scikit, nltk and panda.

Panda helps us read csv data. There will be a seperate post on pandas.

Panda read the training data and presented the data to us in the form of data frame. You may think data frame as sql table from where we can query data.


In [29]:
import pandas as pd
file_path = "../../..//Downloads/train_1.csv"
df = pd.read_csv(file_path)
df[:1]

Unnamed: 0,id,comment_text,toxic
0,1231.0,you are bad.,1


<br>
It's very important to split the given data into two parts. One for training our model and other for testing our model. Scikit provides us with an option to split the data with <i>train_test_split</i>. <i>random_state</i> ensures same data is there in test and train, how much ever times you split.
<br>

In [11]:
import numpy as np
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(df.comment_text, df.toxic, test_size=0.20, random_state=42)
name = df.toxic.name

<br>
Let's be practical. Any data in the world will not be ideal with correct spelling. Some of them will also have internet typo like itttttttt for it. Textblob provides us with collection of words. <i>TextBlob(message).words</i> will give us collection of words from a sentence. We can do word.correct() to correct spelling of the word. It has about 70% accuracy. Lemma is converting words into it's root form. Like given <i>playing</i> it will return <i>play</i>. word.lemmatize() is a callable function to do the same. 

It's important to preprocess because any model will ultimately rely on the occurence of similar words in test and train data. 
<br>

In [14]:
from textblob import TextBlob
import nltk
def split_into_lemmas(message):
    message = unicode(message, 'utf8').lower()
    words = TextBlob(message).words
    return [word.lemma for word in words]

<br>
We are going to train four different learning models on the data given. We are going to compare the results of the
different classifiers. Describing each classifier is beyond the scope of this post. I will provide the relevent link 
at the end of this post.

Pipelines are like a collection of function which is applied to text data sequentially.

Here our pipeline has three functions:
    a. CountVectorizer()
    b. TfidfTransformer()
    c. Classifier() which is SGDClassifier, LogisticRegression, MultinomialNB, SVC in our four different pipeline
    
CountVectorizer : Convert a collection of text documents to a matrix of token counts. For example : "You are awesome"
        will be returned as per the analyzer given to CountVectorizer which is split_into_lemmas in our case. So, the 
        CountVectorizer will turn "You are awesome" into "You", "are", "awesome".  ngram_range further split each token
        into substrings. We have given ngram_range as (2,4), which means each word will be substringed into substring
        of 2, 3 and 4 characters. stop_words remove the commonly occuring english words from the given text.

TfidfTransformer : Transform a count matrix to a normalized tf or tf-idf representation. Tf means term-frequency 
    while tf-idf means term-frequency times inverse document-frequency. This is a common term weighting scheme 
    in information retrieval, that has also found good use in document classification. The goal of using 
    tf-idf instead of the raw frequencies of occurrence of a token in a given document is to scale down 
    the impact of tokens that occur very frequently in a given corpus and that are hence empirically less 
    informative than features that occur in a small fraction of the training corpus.
    
Classifier : Classifier actually learns the data and classifies into the labels. 
</br>

In [19]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.linear_model import SGDClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import MultinomialNB
from sklearn.svm import SVC

<br>
<h2>Making pipeline of SGDClassifier</h2>
<br>


In [15]:
from sklearn.pipeline import Pipeline
text_clf_SGDClassifier = Pipeline([('vect', CountVectorizer(analyzer=split_into_lemmas, ngram_range=(2,4), stop_words='english',lowercase=True)),
                     ('tfidf', TfidfTransformer()),
                     ('clf', SGDClassifier()),
])
text_clf_SGDClassifier.fit(X_train, y_train)

Pipeline(steps=[('vect', CountVectorizer(analyzer=<function split_into_lemmas at 0x11bd320c8>,
        binary=False, decode_error=u'strict', dtype=<type 'numpy.int64'>,
        encoding=u'utf-8', input=u'content', lowercase=True, max_df=1.0,
        max_features=None, min_df=1, ngram_range=(2, 4), preprocess...   penalty='l2', power_t=0.5, random_state=None, shuffle=True,
       verbose=0, warm_start=False))])

<br>
<h2>Making pipeline of LogisticRegression<h2>
<br>

In [16]:
text_clf_LogisticRegression = Pipeline([('vect', CountVectorizer(analyzer=split_into_lemmas, ngram_range=(2,4), stop_words='english',lowercase=True)),
                     ('tfidf', TfidfTransformer()),
                     ('clf', LogisticRegression()),
])
text_clf_LogisticRegression.fit(X_train, y_train)

Pipeline(steps=[('vect', CountVectorizer(analyzer=<function split_into_lemmas at 0x11bd320c8>,
        binary=False, decode_error=u'strict', dtype=<type 'numpy.int64'>,
        encoding=u'utf-8', input=u'content', lowercase=True, max_df=1.0,
        max_features=None, min_df=1, ngram_range=(2, 4), preprocess...ty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False))])

<br>
<h2>Making pipeline of MultinomialNB<h2>
<br>

In [17]:
text_clf_MultinomialNB = Pipeline([('vect', CountVectorizer(analyzer=split_into_lemmas, ngram_range=(2,4), stop_words='english',lowercase=True)),
                     ('tfidf', TfidfTransformer()),
                     ('clf', MultinomialNB()),
])
text_clf_MultinomialNB.fit(X_train, y_train)

Pipeline(steps=[('vect', CountVectorizer(analyzer=<function split_into_lemmas at 0x11bd320c8>,
        binary=False, decode_error=u'strict', dtype=<type 'numpy.int64'>,
        encoding=u'utf-8', input=u'content', lowercase=True, max_df=1.0,
        max_features=None, min_df=1, ngram_range=(2, 4), preprocess...False,
         use_idf=True)), ('clf', MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True))])

<br>
<h2>Making pipeline of SVC<h2>
<br>

In [43]:
text_clf_SVC = Pipeline([('vect', CountVectorizer(analyzer=split_into_lemmas, ngram_range=(2,4), stop_words='english',lowercase=True)),
                     ('tfidf', TfidfTransformer()),
                     ('clf', SVC(kernel='linear')),
])
text_clf_SVC.fit(X_train, y_train)

Pipeline(steps=[('vect', CountVectorizer(analyzer=<function split_into_lemmas at 0x11bd320c8>,
        binary=False, decode_error=u'strict', dtype=<type 'numpy.int64'>,
        encoding=u'utf-8', input=u'content', lowercase=True, max_df=1.0,
        max_features=None, min_df=1, ngram_range=(2, 4), preprocess...,
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False))])

<br>
Learning of data is done by using fit method. Now, our model is ready to predict on the test data.
Let's start prediction.
<br>

In [44]:
predicted_SVC = text_clf_SVC.predict(X_test)

In [33]:
predicted_MultinomialNB = text_clf_MultinomialNB.predict(X_test)

In [34]:
predicted_LogisticRegression = text_clf_LogisticRegression.predict(X_test)

In [35]:
predicted_SGDClassifier = text_clf_SGDClassifier.predict(X_test)

In [2]:
!pip install mediapipe

Collecting mediapipe
  Downloading mediapipe-0.8.11-cp38-cp38-win_amd64.whl (49.0 MB)
Collecting opencv-contrib-python
  Downloading opencv_contrib_python-4.6.0.66-cp36-abi3-win_amd64.whl (42.5 MB)
Installing collected packages: opencv-contrib-python, mediapipe
Successfully installed mediapipe-0.8.11 opencv-contrib-python-4.6.0.66


In [6]:
import mediapipe as mp
import cv2
import time
import numpy as np
import pandas as pd
import os
mpPose = mp.solutions.pose
pose = mpPose.Pose()
mpDraw = mp.solutions.drawing_utils # For drawing keypoints
points = mpPose.PoseLandmark # Landmarks
path = "D:/downloads/DATASET/TRAIN/plank" # enter dataset path
data = []
for p in points:
        x = str(p)[13:]
        data.append(x + "_x")
        data.append(x + "_y")
        data.append(x + "_z")
        data.append(x + "_vis")
data = pd.DataFrame(columns = data) # Empty dataset

In [7]:
count = 0

for img in os.listdir(path):

        temp = []

        img = cv2.imread(path + "/" + img)

        imageWidth, imageHeight = img.shape[:2]

        imgRGB = cv2.cvtColor(img, cv2.COLOR_BGR2RGB)

        blackie = np.zeros(img.shape) # Blank image

        results = pose.process(imgRGB)

        if results.pose_landmarks:

                # mpDraw.draw_landmarks(img, results.pose_landmarks, mpPose.POSE_CONNECTIONS) #draw landmarks on image

                mpDraw.draw_landmarks(blackie, results.pose_landmarks, mpPose.POSE_CONNECTIONS) # draw landmarks on blackie

                landmarks = results.pose_landmarks.landmark

                for i,j in zip(points,landmarks):

                        temp = temp + [j.x, j.y, j.z, j.visibility]

                data.loc[count] = temp

                count +=1

        cv2.imshow("Image", img)

        cv2.imshow("blackie",blackie)

        cv2.waitKey(100)

data.to_csv("dataset3.csv") # save the data as a csv file

In [8]:
data

Unnamed: 0,NOSE_x,NOSE_y,NOSE_z,NOSE_vis,LEFT_EYE_INNER_x,LEFT_EYE_INNER_y,LEFT_EYE_INNER_z,LEFT_EYE_INNER_vis,LEFT_EYE_x,LEFT_EYE_y,...,RIGHT_HEEL_z,RIGHT_HEEL_vis,LEFT_FOOT_INDEX_x,LEFT_FOOT_INDEX_y,LEFT_FOOT_INDEX_z,LEFT_FOOT_INDEX_vis,RIGHT_FOOT_INDEX_x,RIGHT_FOOT_INDEX_y,RIGHT_FOOT_INDEX_z,RIGHT_FOOT_INDEX_vis
0,0.289922,0.593100,-0.025951,0.999539,0.283265,0.609782,-0.015936,0.999151,0.283659,0.611358,...,-0.057191,0.980864,0.684848,0.977012,0.166631,0.818348,0.688589,0.992115,-0.063247,0.963998
1,0.245275,0.522872,-0.683064,0.999585,0.235694,0.508013,-0.677097,0.999235,0.238029,0.503244,...,0.167703,0.981539,0.355404,0.111376,0.048933,0.833656,0.925088,0.799281,0.027391,0.967267
2,0.176850,0.553812,-0.229864,0.999543,0.157233,0.540791,-0.217360,0.999085,0.156546,0.536593,...,0.169403,0.979295,0.637016,0.238359,0.220757,0.844211,0.914533,0.735807,0.100385,0.968350
3,0.249578,0.581288,-0.057560,0.989069,0.227002,0.589646,-0.045526,0.988572,0.225584,0.591502,...,0.169233,0.977160,1.094709,0.476605,0.554517,0.799060,1.225944,0.465166,0.072315,0.957452
4,0.586579,0.501801,0.164065,0.929631,0.564449,0.514188,0.078241,0.925645,0.562566,0.514677,...,1.681661,0.889646,0.758696,0.446640,1.672047,0.728083,0.615799,0.417369,1.564325,0.871773
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
221,0.260820,0.436501,-0.109513,0.979909,0.249065,0.448528,-0.099680,0.980340,0.248566,0.450631,...,0.096050,0.765992,0.793931,0.874893,0.000087,0.723369,0.791034,0.869050,0.000746,0.746369
222,0.554634,0.340087,-0.443477,0.980669,0.564973,0.343498,-0.488323,0.981241,0.562241,0.352715,...,0.443101,0.778746,0.578370,0.812583,0.484400,0.715286,0.688028,0.397476,0.447443,0.740647
223,0.238893,0.385803,-0.247364,0.982584,0.226386,0.368248,-0.223513,0.983045,0.226521,0.361622,...,0.145241,0.779742,0.861822,0.639790,-0.069882,0.738053,0.832216,0.761814,-0.027853,0.758346
224,-0.076196,0.346832,0.079952,0.984301,-0.103910,0.317433,0.036771,0.984718,-0.104165,0.314715,...,0.092924,0.760272,1.246047,0.807414,-0.108435,0.751234,1.224538,0.767197,0.040713,0.742222


In [9]:
from sklearn.svm import SVC
data = pd.read_csv("dataset3.csv")
X,Y = data.iloc[:,:132],data['target']
model = SVC(kernel = 'poly')
model.fit(X,Y)
mpPose = mp.solutions.pose
pose = mpPose.Pose()
mpDraw = mp.solutions.drawing_utils
path = "enter image path"
img = cv2.imread(path)
img = cv2.cvtColor(img, cv2.COLOR_BGR2RGB)
results = pose.process(imgRGB)
if results.pose_landmarks:
        landmarks = results.pose_landmarks.landmark
        for j in landmarks:
                temp = temp + [j.x, j.y, j.z, j.visibility]
        y = model.predict([temp])
        if y == 0:
            asan = "plank"
        else:
            asan = "goddess"
        print(asan)
        cv2.putText(img, asan, (50,50), cv2.FONT_HERSHEY_SIMPLEX,1,(255,255,0),3)
        cv2.imshow("image",img)

KeyError: 'target'