<img src="https://storage.googleapis.com/kaggle-competitions/kaggle/3810/media/treebank.png"/>

# About Competition
The Rotten Tomatoes movie review dataset is a corpus of movie reviews used for sentiment analysis. You are asked to label phrases on a scale of five values: negative, somewhat negative, neutral, somewhat positive, positive. Obstacles like sentence negation, sarcasm, terseness, language ambiguity make this task very challenging.Competition file is available [here](https://www.kaggle.com/c/sentiment-analysis-on-movie-reviews).

# Load Competition Dataset

Competition dataset located in "/kaggle/input"; This path defined by Kaggle to access the competition file. We will list two files from this path as input files.

In [14]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        path=os.path.join(dirname, filename)
        if 'train' in path:
            __training_path=path
        elif 'test' in path:
            __test_path=path

## Input Dataset

In [3]:
#loaded files
print(f'Training path:{__training_path}\nTest path:{__test_path}')

In [1]:
# Kaggle Environment Prepration
#update kaggle env
import sys
#you may update the environment that allow you to run the whole code
!{sys.executable} -m pip install --upgrade scikit-learn=="0.24.2"

In [2]:
#record this information if you need to run the Kernel internally
import sklearn; sklearn.show_versions() 

# Exploratory Data Analysis (EDA)
## General Structure
Mercedes-Benz Greener Manufacturing includes <b>4</b> columns and <b>156060</b> rows.
There are <b>2</b> different data types as follows: *int64, object*.

# Finding Intresting Datapoints
Let's process each field by their histogram frequency and check if there is any intresting data point.

There is 1 number of intresting value in the following column.
The below table shows each <b>Value</b> of each <b>Field</b>(column) with their total frequencies, <b>Lower</b> shows the lower frequency of normal distribution, <b>Upper</b> shows the upper bound frequency of normal distribution, and <b>Criteria</b> shows if the frequnecy passed <b>Upper bound</b> or <b>Lower bound</b>.
<table border="1" class="dataframe">
  <thead>
    <tr style="text-align: right;">
      <th></th>
      <th>Field</th>
      <th>Value</th>
      <th>Frequency</th>
      <th>Lower</th>
      <th>Upper</th>
      <th>Criteria</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <th>0</th>
      <td>Sentiment</td>
      <td>2</td>
      <td>79582</td>
      <td>7072.8536</td>
      <td>79563.338</td>
      <td>Upper</td>
    </tr>
  </tbody>
</table>


For example, in the <b>Sentiment</b> column the value of <b>2</b> has <b>79582</b> repeatation but this number is not between Lower bound(7072.8536) and Upper bound(79563.338).


Let     $C_0=2$   and   $Freq(C_0)=79582$     ,   $Upper(C_0)=79563.338$     ,   $Lower(C_0)=7072.8536$

$Freq(C_0) > Upper(C_0)$.

# Input Dataset

In [None]:
def __load__data(__training_path, __test_path, concat=False):
	"""load data as input dataset
	params: __training_path: the training path of input dataset
	params: __test_path: the path of test dataset
	params: if it is True, then it will concatinate the training and test dataset as output
	returns: generate final loaded dataset as dataset, input and test
	"""
	# LOAD DATA
	import pandas as pd
	__train_dataset = pd.read_csv(__training_path, delimiter=',' if __training_path.endswith('csv') else '\t')
	__test_dataset = pd.read_csv(__test_path, delimiter=',' if __training_path.endswith('csv') else '\t')
	return __train_dataset, __test_dataset
__train_dataset, __test_dataset = __load__data(__training_path, __test_path, concat=True)
__train_dataset.head()

In [None]:
# STORE SUBMISSION RELEVANT COLUMNS
__test_dataset_submission_columns = __test_dataset['PhraseId']

### Discard Irrelevant Columns
In the given input dataset there are <b>2</b> columns that can be removed as follows:* PhraseId,SentenceId *.

In [None]:
# DISCARD IRRELEVANT COLUMNS
__train_dataset.drop(['PhraseId', 'SentenceId'], axis=1, inplace=True)
__test_dataset.drop(['PhraseId', 'SentenceId'], axis=1, inplace=True)

# Text Processing
The dataset has <b>1</b> text value as follows: <b>Phrase</b>.
Now, let's covert the text as follows.

- First, convert text to lowercase;

- Second, strip all punctuations;

- Finally, convert all numbers in text to 'num'; therefore, in the next step our model will use a single token instead of valriety of tokens of numbers.

In [None]:
# PREPROCESSING-1
import nltk
import re
import string
_TEXT_COLUMNS = ['Phrase']
def process_text(__dataset):
    for _col in _TEXT_COLUMNS:
        process_text = [t.lower() for t in __dataset[_col]]
        # strip all punctuation
        table = str.maketrans('', '', string.punctuation)
        process_text = [t.translate(table) for t in process_text]
        # convert all numbers in text to 'num'
        process_text = [re.sub(r'\d+', 'num', t) for t in process_text]
        __dataset[_col] = process_text
    return __dataset
__train_dataset = process_text(__train_dataset)
__test_dataset = process_text(__test_dataset)

### Target Column
The target column is the value which we need to predict.
Therefore, we need to detach the target columns in prediction.
Note that if we don't drop this fields, it will generate a model with high accuracy on training and worst accuracy on test (because the value in test dataset is Null).
Here is the list of *target column*: <b>Sentiment</b>

In [None]:
# DETACH TARGET
__feature_train = __train_dataset.drop(['Sentiment'], axis=1)
__target_train =__train_dataset['Sentiment']
__feature_test = __test_dataset

# Text Vectorizer
In the next step, we will transfer pre-processed text columns to a vector representation. The vector representations allows us to train a model based on numerical representations.
We will use TfidfVectorizer and more detail can be found [here](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html).

In [None]:
# PREPROCESSING-2
from sklearn.feature_extraction.text import TfidfVectorizer
import scipy.sparse as sparse
from scipy.sparse import hstack, csr_matrix
_TEXT_COLUMNS = ['Phrase']
__temp_train_data = __feature_train[_TEXT_COLUMNS]
__feature_train.drop(_TEXT_COLUMNS, axis=1, inplace=True)
__feature_train_object_array = []
__temp_test_data = __feature_test[_TEXT_COLUMNS]
__feature_test.drop(_TEXT_COLUMNS, axis=1, inplace=True)
__feature_test_object_array = []
for _col in _TEXT_COLUMNS:
    __tfidfvectorizer = TfidfVectorizer(max_features=3000)
    vector_train = __tfidfvectorizer.fit_transform(__temp_train_data[_col])
    __feature_train_object_array.append(vector_train)
    vector_test = __tfidfvectorizer.transform(__temp_test_data[_col])
    __feature_test_object_array.append(vector_test)
__feature_train = sparse.hstack([__feature_train] + __feature_train_object_array).tocsr()
__feature_test = sparse.hstack([__feature_test] + __feature_test_object_array).tocsr()

# Training Model and Prediction
First, we will train a model based on preprocessed values of training data set.
Second, let's predict test values based on the trained model.

## Random Forest Classifier
We will use *RandomForestClassifier* which is a meta estimator that fits a number of decision tree classifiers on various sub-samples of the dataset and uses averaging to improve the predictive accuracy and control over-fitting.
The sub-sample size is controlled with the max_samples parameter if bootstrap=True (default), otherwise the whole dataset is used to build each tree.
More detail about *RandomForestClassifier* can be found [here](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html).

In [None]:
# MODEL
import numpy as np
from sklearn.ensemble import RandomForestClassifier
__model = RandomForestClassifier()
__model.fit(__feature_train, __target_train)
__y_pred = __model.predict(__feature_test)

# Submission File
We have to maintain the target columns in "submission.csv" which will be submitted as our prediction results.

In [None]:
# SUBMISSION
submission = pd.DataFrame(columns=['PhraseId'], data=__test_dataset_submission_columns)
submission['Sentiment'] = __y_pred
submission.head()

In [None]:
# save submission file
submission.to_csv("kaggle_submission.csv", index=False)