# <font color = 'pickle'>**Pre-processing/Feature Engineering**

# <font color ='pickle'> **Task1: Download data and combine data from multiple files into a single dataframe**

In this task we download the moview reviews from the following link:
https://www.cs.cornell.edu/people/pabo/movie-review-data/scale_whole_review.tar.gz


- The data has movie reviews from four different reviewers: 
(1) Dennis+Schwartz,  (2) James+Berardinelli, (3) Scott+Renshaw and (4) Steve+Rhodes.

- We have to extract the reviews of  the four reviewers in a single dataframe.
- The final dataframe will have two columns (1) Moview Review and (2) Reviewer Name.


# <font color = 'dodgerblue'>**Import libraries**

In [None]:
# import libraries
import pandas as pd
from pathlib import Path
import zipfile
import tarfile
import os

# <font color = 'dodgerblue'>**Mount Google drive and Specify folder paths**

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
# Make sure you change the Path to where you want to save data
# In the code below - datasets is the folder name in my google drive
# you can change this to appropriate folder for your drive
# for example you may want to save data to BUAN6341/HW1/Data
# in this case the below code should be modified to : '/content/drive/MyDrive/BUAN6341/HW2/Data'

base_path = r'/content/drive/MyDrive/6342_NLP/Lecture2/'

In [None]:
# create a POSIX path for data folder
# we can use this to navigate file system
base_folder = Path(base_path)

In [None]:
# I usually keep teh compressed files in archive folder and unzip these files in data folder
# You can skip this step if you do not want to follow this folder structure

# The / can join several paths or a mix of paths and strings given, atleast one of those
# paths should be an instance of class `Path` from `pathlib` library (as shown below).

archive_folder = base_folder/'archive'
data_folder = base_folder/'datasets'

In [None]:
# check current working directory
Path.cwd()

PosixPath('/content')

## <font color = 'dodgerblue'>**Step1: use wget to download data files from URl**

Download a  file to the filesystem from a url using the wget commmand <br>
URL = https://www.cs.cornell.edu/people/pabo/movie-review-data/scale_whole_review.tar.gz<br>

**Syntax** <br> !wget {url} -P {path_to_save_file} -O filename <br>
- To use variables in bash commands , we have to use {} brackets 
- if we do not specify -P , files will be saved in current direcory 
- Use O if you want to overwrite existing file
<br>

Alternatively we can also use <br>
!wget url -P 'path_to_save_file'

In [None]:
# use wget to download the data
file = archive_folder/'reviews.tar.gz'
URL = 'https://www.cs.cornell.edu/people/pabo/movie-review-data/scale_whole_review.tar.gz'

# CODE HERE

!wget {URL} -P {archive_folder} -O {file}

--2022-09-11 02:17:05--  https://www.cs.cornell.edu/people/pabo/movie-review-data/scale_whole_review.tar.gz
Resolving www.cs.cornell.edu (www.cs.cornell.edu)... 132.236.207.36
Connecting to www.cs.cornell.edu (www.cs.cornell.edu)|132.236.207.36|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 8853204 (8.4M) [application/x-gzip]
Saving to: ‘/content/drive/MyDrive/6342_NLP/Lecture2/archive/reviews.tar.gz’


2022-09-11 02:17:05 (25.0 MB/s) - ‘/content/drive/MyDrive/6342_NLP/Lecture2/archive/reviews.tar.gz’ saved [8853204/8853204]



## <font color = 'dodgerblue'>**Step2: check content of folder where data was downloaded**

In [None]:
# step 2: check teh content of the folder where file was downloaded
# List the content of current directory
# folder is a Pathlib Path. We can iterate over this folder using .iterdir()
# Code here

for entries in archive_folder.iterdir():
  if 'tar' in entries.name:
    print(entries.name)

review_polarity.tar.gz
reviews.tar.gz


## <font color = 'dodgerblue'>**Step3: Check content of zipped/tar folder**

We can construct a path to the file by joining the parts using the special operator /. The / can join several paths or a mix of paths and strings given, atleast one of those paths should be an instance of class `Path` from `pathlib` library (as shown below).


In [None]:
# path for zipfile
file = archive_folder / 'reviews.tar.gz'

We will open and read the zip file using 
```python
with zipfile.ZipFile(file, mode)
``` 
here `file` is the file to open and we can specify the mode (read, write etc,)
In below command we have used `'r'` to specify that we can open the file in reading mode.
Finally we use namelist() method to list the content of zipped folder.

In [None]:
# list the content of the zipped folder
# code here

with  tarfile.open(file, 'r') as tar:
  tar_file_names = tar.getnames()

## <font color = 'dodgerblue'>**Step 4: unzip/untar files**

In [None]:
# %%timeit will tell us the time it takes to execute this cell
# %%timeit 
# unzip the file 

with tarfile.open(file, 'r') as tar:
    tar.extractall(path = data_folder)

In [None]:
tar_file_names[0:10]

['scaledata.README.1.0.txt',
 'scale_whole_review/Dennis+Schwartz',
 'scale_whole_review/Dennis+Schwartz/txt.parag',
 'scale_whole_review/Dennis+Schwartz/txt.parag/16748.txt',
 'scale_whole_review/Dennis+Schwartz/txt.parag/17108.txt',
 'scale_whole_review/Dennis+Schwartz/txt.parag/17109.txt',
 'scale_whole_review/Dennis+Schwartz/txt.parag/17110.txt',
 'scale_whole_review/Dennis+Schwartz/txt.parag/17111.txt',
 'scale_whole_review/Dennis+Schwartz/txt.parag/17116.txt',
 'scale_whole_review/Dennis+Schwartz/txt.parag/17117.txt']

## <font color = 'dodgerblue'>**Step 5: Create DataFrame**

In [None]:
extracted_folder = data_folder/'scale_whole_review'

# Create a list containing all text file paths
def list_full_paths(directory):
    return [os.path.join(directory, file) for file in os.listdir(directory)]

# Function to read contents of a text file and return the text
def read_csv_with_encoding(path):
  with open(path,'r', encoding = 'ISO-8859-1') as f:
        text = f.read()
  return text

# Combine all text data into a list, with each file data as each row of list
def combine_reviews(path):
  reviews = []
  for i in path.iterdir():
    # df = pd.concat(map(read_csv_with_encoding, filesList[i])).reset_index(drop=True)
    with open(i,'r', encoding = 'ISO-8859-1') as f:
      text = f.read()
      reviews.append(text)
  return reviews

filesList = []
df = pd.DataFrame()

# Loop through Folders of Reviews and create data frame
for folder in extracted_folder.iterdir():
  df2 = pd.DataFrame(combine_reviews(extracted_folder/folder/'txt.parag'), columns=['Moview Review'])
  df2['Reviewer Name'] = str(folder).split('/')[-1]
  df = df.append(df2, ignore_index=True)


In [None]:
df.head()

Unnamed: 0,Moview Review,Reviewer Name
0,Let's start by clearing up one point. First-ti...,Steve+Rhodes
1,"The ending credits do not confirm it, but it w...",Steve+Rhodes
2,Three years ago we started out our son Jeffrey...,Steve+Rhodes
3,"HEAD OVER HEELS, a romantic comedy twist on RE...",Steve+Rhodes
4,"In DOWN TO EARTH, Chris Rock plays a strugglin...",Steve+Rhodes


In [None]:
df.tail()

Unnamed: 0,Moview Review,Reviewer Name
5001,RUMBLE FISH (director/writer: Francis Ford Cop...,Dennis+Schwartz
5002,EQUINOX (director/writer: Alan Rudolph; cinema...,Dennis+Schwartz
5003,MAHLER (director/writer: Ken Russell; cinemato...,Dennis+Schwartz
5004,SUITE 16 (director: Dominique Deruddere; scree...,Dennis+Schwartz
5005,PLEASANTVILLE (director/writer:Gary Ross; cine...,Dennis+Schwartz


# <font color ='pickle'> **Task2: Feature Engineering and Preprocessing IMDB - 8 points**


<font color = 'dodgerblue'>**You can use regular expression or spacy for this task** </font>

- <font color = 'dodgerblue'> **PreProcessing**:</font>
  1. Remove HTML tags and new line character (\n)
  2. Remove email, urls and punctuations

  For preprocessing,  write your own simple functions  and your final cleaned text should be saved in a new column - `cleaned_text`.


- <font color = 'dodgerblue'> **Feature Engineering**

  Use the `cleaned_text` column you created in the previous step and extract following features as new column.

  1. number of words
  2. number of characters
  3. number of characters without space
  4. average word length
  5. count of numbers(37, 201, 20 etc.)

You will use the imdb moview review dataset. The details of the data can be found from this link : https://ai.stanford.edu/~amaas/data/sentiment/.

Description of the data from the above link :
"*This is a dataset for binary sentiment classification containing substantially more data than previous benchmark datasets. We provide a set of 25,000 highly polar movie reviews for training, and 25,000 for testing. There is additional unlabeled data for use as well. Raw text and already processed bag of words formats are provided. See the README file contained in the release for more details.*".

We extracted the data from text files and save  the train and test data as csv files. We further cleaned the dataset. <font size =4, color = 'dodgerblue'>**We will use train.csv file for this task.**</font>.
The file is availibale in Lecture2 data subfolder.

## <font color = 'dodgerblue'>**Step 1: Install Spacy, Beautiful Soup and Swifter**

In [None]:
!python -m spacy download en_core_web_sm
!pip install swifter
import spacy
import swifter

2022-09-11 02:30:07.300873: E tensorflow/stream_executor/cuda/cuda_driver.cc:271] failed call to cuInit: CUDA_ERROR_NO_DEVICE: no CUDA-capable device is detected
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting en-core-web-sm==3.4.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.4.0/en_core_web_sm-3.4.0-py3-none-any.whl (12.8 MB)
[K     |████████████████████████████████| 12.8 MB 5.1 MB/s 
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting swifter
  Downloading swifter-1.3.4.tar.gz (830 kB)
[K     |████████████████████████████████| 830 kB 5.0 MB/s 
Collecting psutil>=5.6.6
  Downloading psutil-5.9.2-cp37-cp37m-manylinux_2_12_x86_64.manylinux2010_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl (281 kB)
[K 

In [None]:
nlp = spacy.load('en_core_web_sm')
print(nlp.pipe_names)

['tok2vec', 'tagger', 'parser', 'attribute_ruler', 'lemmatizer', 'ner']


In [None]:
disabled = nlp.select_pipes(disable= ['tok2vec', 'tagger', 'parser', 'attribute_ruler', 'lemmatizer'])
print(nlp.pipe_names)

['ner']


In [None]:
from bs4 import BeautifulSoup
import re

## <font color = 'dodgerblue'>**Step 2: Read Train.csv into a Dataframe**

In [44]:
url = 'https://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz'
!wget {url} -P {archive_folder}

with tarfile.open(archive_folder/'aclImdb_v1.tar.gz', 'r') as tar:
 tar.extractall(data_folder/'aclImdb_v1')

--2022-09-11 03:11:10--  https://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz
Resolving ai.stanford.edu (ai.stanford.edu)... 171.64.68.10
Connecting to ai.stanford.edu (ai.stanford.edu)|171.64.68.10|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 84125825 (80M) [application/x-gzip]
Saving to: ‘/content/drive/MyDrive/6342_NLP/Lecture2/archive/aclImdb_v1.tar.gz.6’


2022-09-11 03:11:16 (13.5 MB/s) - ‘/content/drive/MyDrive/6342_NLP/Lecture2/archive/aclImdb_v1.tar.gz.6’ saved [84125825/84125825]



In [None]:
mdf = pd.read_csv(base_path + '/Data/train.csv')

## <font color = 'dodgerblue'>**Step 3: Remove HTML Text**

In [None]:
def RemoveHTML(text):
  return BeautifulSoup(text , "html.parser").get_text()

df['cleaned_text'] = df['Reviews'].swifter.apply(RemoveHTML)

Pandas Apply:   0%|          | 0/25000 [00:00<?, ?it/s]

## <font color = 'dodgerblue'>**Step 4: Remove New Line Text**

In [None]:
def RemoveNewLine(text):
  return re.sub(r'[\n]', '', text)

df['cleaned_text'] = df['cleaned_text'].swifter.apply(RemoveNewLine)

Pandas Apply:   0%|          | 0/25000 [00:00<?, ?it/s]

## <font color = 'dodgerblue'>**Step 5: Remove Punctuation, URL and Emails**

In [None]:
def RemoveOtherStuff(text):
  doc = nlp(text)
  t = ' '.join([token.text for token in doc if not token.is_punct])
  doc = nlp(t)
  t = ' '.join([token.text for token in doc if not token.like_url])
  doc = nlp(t)
  t = ' '.join([token.text for token in doc if not token.like_email])
  return t

df['cleaned_text'] = df['cleaned_text'].swifter.apply(RemoveOtherStuff)

Pandas Apply:   0%|          | 0/25000 [00:00<?, ?it/s]

In [None]:
df['cleaned_text'][0]

"Ever wanted to know just how much Hollywood could get away with before the Hayes Code was officially put into effect Well unfortunately Convention City is lost so well just have to watch Tarzan and His Mate to find out For 1934 there is a remarkable amount of sexual innuendo and even exposed flesh Just look at Jane 's nude swim While Tarzan is often thought of as b adventure films made for young boys and no one else this picture proves that the series was originally very adult Over seventy years later it is still as sexy as it was when it came out In addition to the envelope pushing taboo nature it is a superb and exciting adventure story I ' ve always enjoyed the jungle films that Hollywood churned out in the 30s and the 40s but there are few from the genre I 'd call great films Tarzan and His Mate is by far the best film from this long gone subgenre The sequences of the attacks on the safari by either apes or natives still manage to create tension today Also the animals are all too 

## <font color = 'dodgerblue'>**Step 6: Count Number of Words**

In [None]:
def countNumWords(text):
  return len(text.split())

df['number of words'] = df['cleaned_text'].swifter.apply(countNumWords)

Pandas Apply:   0%|          | 0/25000 [00:00<?, ?it/s]

## <font color = 'dodgerblue'>**Step 7: Count Number of Characters**

In [None]:
def countNumCharacters(text):
  return len(text)

df['number of characters'] = df['cleaned_text'].swifter.apply(countNumCharacters)

Pandas Apply:   0%|          | 0/25000 [00:00<?, ?it/s]

## <font color = 'dodgerblue'>**Step 8: Count Number of Characters Without Space**

In [None]:
def countNumCharactersWoSpace(text):
  return len(re.sub(' ', '', text))

df['number of characters without space'] = df['cleaned_text'].swifter.apply(countNumCharactersWoSpace)

Pandas Apply:   0%|          | 0/25000 [00:00<?, ?it/s]

## <font color = 'dodgerblue'>**Step 9: Count of Numbers**

In [28]:
def countNums(text):
  doc = nlp(text)
  return len([token.text for token in doc if token.is_digit])

df['count of number'] = df['cleaned_text'].swifter.apply(countNums)

Pandas Apply:   0%|          | 0/25000 [00:00<?, ?it/s]

## <font color = 'dodgerblue'>**Step 10: Get Average Word Length**

In [45]:
import math
import numpy as np
# def avgWordLen(text):
#   doc = nlp(text)
#   return sum(df['number of words'])/len(df['number of words'])

df['average Word Length'] = df['cleaned_text'].apply(lambda x: np.mean([len(w) for w in x.split()]))

In [46]:
dfout = df
# dfout.reset_index(drop=True)
dfout = dfout.loc[:, ~dfout.columns.str.contains('^Unnamed')]
dfout = dfout.loc[:, ~dfout.columns.str.contains('^Labels')]

In [47]:
dfout

Unnamed: 0,Reviews,cleaned_text,number of words,number of characters,number of characters without space,count of number,average Word Length
0,Ever wanted to know just how much Hollywood co...,Ever wanted to know just how much Hollywood co...,243,1292,1050,1,4.320988
1,The movie itself was ok for the kids. But I go...,The movie itself was ok for the kids But I got...,59,283,225,0,3.813559
2,You could stage a version of Charles Dickens' ...,You could stage a version of Charles Dickens A...,850,4674,3825,1,4.500000
3,this was a fantastic episode. i saw a clip fro...,this was a fantastic episode i saw a clip from...,131,636,506,0,3.862595
4,and laugh out loud funny in many scenes.<br />...,and laugh out loud funny in many scenes The mo...,105,545,441,0,4.200000
...,...,...,...,...,...,...,...
24995,I really felt cheated after seeing this pictur...,I really felt cheated after seeing this pictur...,287,1446,1160,5,4.041812
24996,"Ok,so.....guy gets bitten by a bat and then tu...",Ok so guy gets bitten by a bat and then turns ...,44,208,165,0,3.750000
24997,I am generally more willing to be open minded ...,I am generally more willing to be open minded ...,358,1875,1518,1,4.240223
24998,What a load of rubbish.. I can't even begin to...,What a load of rubbish I ca n't even begin to ...,273,1430,1158,3,4.241758
