# GOAL
In this competition we are asked to build a model that recognizes toxicity and minimizes unintended bias

### `Table of contents`

1. [Loading packages](#load)
2. [Understainding unintended bias](#bias)
3. [loading data](#data)
4. [Data Cleaning](#clean)
5. [Training data](#traindata)
6. [Baseline Model](#model)
7. [Traning Stage](#modeltrain)
8. [Prediction](#Prediction)
9. [Submission](#Submission)

## Loading packages <a class="anchor" id="load"></a>

In [None]:
from IPython.display import HTML
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.base import BaseEstimator, ClassifierMixin
from sklearn.utils.validation import check_X_y, check_is_fitted
from sklearn.linear_model import LogisticRegression
from scipy import sparse
import re
import string

## Understainding unintended bias <a class="anchor" id="bias"></a>

In [None]:
HTML('<iframe width="800" height="400" src="https://www.youtube.com/embed/59bMh59JQDo" frameborder="0" allow="accelerometer; autoplay; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>')

## loading data <a class="anchor" id="data"></a>

Examples

Some examples of comments and their associated toxicity and identity labels. Label values range from 0.0 - 1.0 represented the fraction of raters who believed the label fit the comment.

1. Comment: I'm a white woman in my late 60's and believe me, they are not too crazy about me either!!

    Toxicity Labels:` All 0.0`
    
    Identity Mention Labels: `female: 1.0, white: 1.0 (all others 0.0)`

2. Comment: Why would you assume that the nurses in this story were women?

    Toxicity Labels: `All 0.0`
    
    Identity Mention Labels: `female: 0.8 (all others 0.0)`

3. Comment: Continue to stand strong LGBT community. Yes, indeed, you'll overcome and you have.

    Toxicity Labels: `All 0.0`
    
    Identity Mention Labels: `homosexual_gay_or_lesbian: 0.8, bisexual: 0.6, transgender: 0.3 (all others 0.0)`

In [None]:
train_df = pd.read_csv('../input/train.csv')
test_df = pd.read_csv('../input/test.csv')

## Data Cleaning <a class="anchor" id="clean"></a>

`TfidfVectorizer` - It Transforms text to feature vectors that can be used as input to estimator.

In [None]:
text = re.compile(f'([{string.punctuation}“”¨«»®´·º½¾¿¡§£₤‘’])')
def tokenize(s): return text.sub(r' \1 ', s).split()
length = train_df.shape[0]
Vectorize = TfidfVectorizer(ngram_range=(1,2), tokenizer=tokenize,
               min_df=3, max_df=0.9, strip_accents='unicode', use_idf=1,
               smooth_idf=1, sublinear_tf=1 )

## Training data <a class="anchor" id="traindata"></a>

In [None]:
train = Vectorize.fit_transform(train_df["comment_text"])
test = Vectorize.transform(test_df["comment_text"])

In [None]:
#Target
y = np.where(train_df['target'] >= 0.5, 1, 0)

## Model <a class="anchor" id="model"></a>

In [None]:
class NbSvmClassifier(BaseEstimator, ClassifierMixin):
    def __init__(self, C=1.0, dual=False, n_jobs=1):
        self.C = C
        self.dual = dual
        self.n_jobs = n_jobs

    def predict(self, x):
        # Verify that model has been fit
        check_is_fitted(self, ['_r', '_clf'])
        return self._clf.predict(x.multiply(self._r))

    def predict_proba(self, x):
        # Verify that model has been fit
        check_is_fitted(self, ['_r', '_clf'])
        return self._clf.predict_proba(x.multiply(self._r))

    def fit(self, x, y):
        y = y
        x, y = check_X_y(x, y, accept_sparse=True)

        def pr(x, y_i, y):
            p = x[y==y_i].sum(0)
            return (p+1) / ((y==y_i).sum()+1)
        
        self._r = sparse.csr_matrix(np.log(pr(x,1,y) / pr(x,0,y)))
        x_nb = x.multiply(self._r)
        self._clf = LogisticRegression(C=self.C, dual=self.dual, n_jobs=self.n_jobs).fit(x_nb, y)
        return self

# Traning Stage <a class="anchor" id="modeltrain"></a>

In [None]:
NbSvm = NbSvmClassifier(C=1.5, dual=True, n_jobs=-1)
NbSvm.fit(train, y)

## Prediction <a class="anchor" id="Prediction"></a>

In [None]:
prediction=NbSvm.predict_proba(test)[:,1]

## Submission <a class="anchor" id="Submission"></a>

In [None]:
submission = pd.read_csv("../input/sample_submission.csv")
submission['prediction'] = prediction
submission.to_csv('submission.csv', index=False)

Refrence : 
[Kernel](https://www.kaggle.com/jhoward/nb-svm-strong-linear-baseline)