#**ML-POWERED CONVERSION OF SUBJECTIVE TEXT TO OBJECTIVE TEXT**
Kieran Mendoza, Wang Hengyue

## Introduction
Often, the biases and the judgements of journalists get in the way of the facts and the truth. This project aims to first detect whether a piece of text is subjective or objective, then reframe it to eliminate the prejudices and be more factual. 

This project will include the development of a model that will detect the objectivity of the text input, and then will progress to transform it to  subjective text and vice-versa. We intend to use an [ Unsupervised Controllable Text Formalization to achieve this.](https://arxiv.org/abs/1809.04556) 

**Domain: Media, Literature**

**Subject Area: Unsupervised Learning, Natural Language Processing**

## Objectives
* Detection of subjective or objective text
* Manipulation of text from being subjective to objective and vice-versa.

## Target Audience
*   Fake news Detection agencies - a highly subjective article is likely to contain some fallacious statements
*   Media companies - a quick method of turning an otherwise boring, dull chain of facts into a more engaging story
*  English Teachers - aid for grading essays to determine if the student is writing subjectively or objectively

## Target Outcome and Benefits
1. Catering to the preference of different readers - some readers prefer the facts presented neatly and clearly while others may want more personal voice
2. Formation of an engaging story from a “skeleton” of facts collected, allowing for a “quick story”.






## Data Collection and Data Cleaning

In [None]:
!git clone https://github.com/Kimame04/nlp-text-subjectivity-conversion.git

fatal: destination path 'nlp-text-subjectivity-conversion' already exists and is not an empty directory.


In [None]:
#imports
import os
import pandas as pd
import matplotlib.pyplot as plt
import time
import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split

In [None]:
nltk.download('punkt')
nltk.download('wordnet')
nltk.download('stopwords')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

Annoyingly, plenty of the data has custom file extensions, which means that they must first be converted to their original file extensions (.txt) before any operations can be conducted on it.

In the end, I combined the reviews of each reviewer into a dataframe, which is then combined into a 'complete collection' dataset, for further analysis.

Each dataframe has the following attributes:

- Review content
- Rating 
- ID 
- Three-class label
- Four-class label

In [None]:
path = '/content/nlp-text-subjectivity-conversion/data'

def readTxt(fileName):
  fileObj = open(fileName,'r',encoding="windows-1252")
  words = fileObj.read().splitlines()
  fileObj.close()
  return words

all_series = []

for folder in os.listdir(path):
  subpath = path+'/'+folder
  if os.path.isdir(subpath):
    temp = []
    for filename in os.listdir(subpath):
      filepath = os.path.join(subpath,filename)
      if (filepath.count('.txt')<1):
        new_filepath = filepath+'.txt'
        os.rename(filepath,new_filepath)
      elif (filepath.count('README')<1):
        series = pd.Series(readTxt(filepath))
        series.name = filename
        temp.append(series)
    all_series.append(temp)


In [None]:
dataframes=[]
for series in all_series:
  temp = pd.concat(series,axis=1)
  dataframes.append(temp.sort_index(axis=1))
dataframes[1]

Unnamed: 0,id.Scott+Renshaw.txt,label.3class.Scott+Renshaw.txt,label.4class.Scott+Renshaw.txt,rating.Scott+Renshaw.txt,subj.Scott+Renshaw.txt
0,11961,0,0,0,i'm guessing -- and from the available evidenc...
1,13915,0,0,0,"there's bad buzz , and then there's the the ba..."
2,2790,0,0,0,director : richard rush . director richard rus...
3,3285,0,0,0,screenplay : johnny brennan & kamal ahmed and ...
4,10264,0,0,0.1,screenplay : tim burns & tom stern and anthony...
...,...,...,...,...,...
897,5754,2,3,1,"screenplay : nicholas kazan , robin swicord . ..."
898,6072,2,3,1,"screenplay : stanley tucci , joseph tropiano ...."
899,6371,2,3,1,it is the kind of situation you might have see...
900,6892,2,3,1,billy bob thornton has presented me with a won...


In [None]:
df = dataframes[0].iloc[:,1:].copy()
punctuation_lines = list("?.:;,!")
for punct_sign in punctuation_signs:
    df = df.str.replace(punct_sign, '')
    
df = df.str.replace("'s","")
wordnet_lemmatizer = WordNetLemmatizer()
stop_words = list(stopwords.words('english'))

for stop_word in stop_words:
    regex_stopword = r"\b" + stop_word + r"\b"
    df = df.str.replace(regex_stopword, '')

Unnamed: 0,rt-polarity.neg.txt,rt-polarity.pos.txt
0,"simplistic , silly and tedious .",the rock is destined to be the 21st century's ...
1,"it's so laddish and juvenile , only teenage bo...","the gorgeously elaborate continuation of "" the..."
2,exploitative and largely devoid of the depth o...,effective but too-tepid biopic
3,[garbus] discards the potential for pathologic...,if you sometimes like to go to the movies to h...
4,a visually flashy but narratively opaque and e...,"emerges as something rare , an issue movie tha..."
...,...,...
5326,a terrible movie that some people will neverth...,both exuberantly romantic and serenely melanch...
5327,there are many definitions of 'time waster' bu...,mazel tov to a film about a family's joyous li...
5328,"as it stands , crocodile hunter has the hurrie...",standing in the shadows of motown is the best ...
5329,the thing looks like a made-for-home-video qui...,it's nice to see piscopo again after all these...


## Detection of Subjectivity in Text

We use two methods to train a supervised learning algorithm, using sentiment (continuous) and polarity (discrete).

In [None]:
polarityDf = dataframes[0].copy().iloc[:,1:]
polarityDf = pd.DataFrame(pd.concat((polarityDf.iloc[:,0],polarityDf.iloc[:,1]),axis=0)).reset_index(drop=True)
polarityDf.columns = ['desc']
polarityDf['subjectivity'] = 'negative'
polarityDf['subjectivity'][5331:] = 'positive'
polarityDf

Unnamed: 0,desc,subjectivity
0,"simplistic , silly and tedious .",neg
1,"it's so laddish and juvenile , only teenage bo...",neg
2,exploitative and largely devoid of the depth o...,neg
3,[garbus] discards the potential for pathologic...,neg
4,a visually flashy but narratively opaque and e...,neg
...,...,...
10657,both exuberantly romantic and serenely melanch...,pos
10658,mazel tov to a film about a family's joyous li...,pos
10659,standing in the shadows of motown is the best ...,pos
10660,it's nice to see piscopo again after all these...,pos


## NLP 

## Resources

**Papers**

https://ojs.aaai.org/index.php/AAAI/article/view/6433

https://arxiv.org/abs/1809.04556

**Basis of our text formalisation**

https://github.com/parajain/uctf

**Dataset** 

http://www.cs.cornell.edu/people/pabo/movie-review-data/