<a href="https://colab.research.google.com/github/rojinadeuja/Data-Processing-Utilities/blob/main/CSV-to-Corpus.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


## Corpus
A corpus is a collection of text documents, and corpora is the plural of corpus. So a custom corpus is really just a bunch of text files in a directory, often alongside many other directories of text files.

To train your own vectors, first you'll need to prepare your corpus as a single text file with all words separated by one or more spaces or tabs. If your corpus has multiple documents, the documents (only) should be separated by new line characters.

## Import Modules

In [None]:
import pandas as pd
import numpy as np
import nltk
nltk.download('wordnet')
from nltk.stem.wordnet import WordNetLemmatizer
from nltk.tokenize import RegexpTokenizer

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Unzipping corpora/wordnet.zip.


## Load Dataset

In [None]:
df = pd.read_csv('/content/drive/My Drive/Colab Notebooks/IMDB.csv')
df.head()

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. <br /><br />The...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically there's a family where a little boy ...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive


## Create Document array using text field

In [None]:
docs_array = df['review']
print("Dimension of the documents array: ", docs_array.shape)

Dimension of the documents array:  (50000,)


## Tokenize the words

In [None]:
# Function for convert a list of sentences to a list of lists containing tokenized words
def docs_preprocessor(docs):
    tokenizer = RegexpTokenizer(r'\w+') # Tokenize the words.
    
    for idx in range(len(docs)):
        docs[idx] = docs[idx].lower()  # Convert to lowercase.
        docs[idx] = tokenizer.tokenize(docs[idx])  # Split into words.

    # Remove numbers, but not words that contain numbers.
    docs = [[token for token in doc if not token.isdigit()] for doc in docs]
    
    # Remove words that are only one character.
    docs = [[token for token in doc if len(token) > 1] for doc in docs]
    
    # Lemmatize all words in documents.
    lemmatizer = WordNetLemmatizer()
    docs = [[lemmatizer.lemmatize(token) for token in doc] for doc in docs]
  
    return docs

In [None]:
# Convert a list of sentences to a list of lists containing tokenized words
%time docs = docs_preprocessor(docs_array)

CPU times: user 1min 24s, sys: 562 ms, total: 1min 24s
Wall time: 1min 24s


In [None]:
print("Length of the 2D Array of Tokenized Documents: ", len(docs))

Length of the 2D Array of Tokenized Documents:  50000


In [None]:
# View tokenized list of words
docs[0][0:6]

['one', 'of', 'the', 'other', 'reviewer', 'ha']

# Option 1: Create corpus with words only

In [None]:
# Combine list of lists into a single list
from itertools import chain
combined_docs = list(chain.from_iterable(docs)) #

In [None]:
# Separate words by whitespace
combined_docs_sep = ' '.join(combined_docs)

In [None]:
# Check if merged
combined_docs_sep[0:100]

'one of the other reviewer ha mentioned that after watching just oz episode you ll be hooked they are'

In [None]:
# Open a file in write mode and write the list of words to file
fo = open("word_corpus.txt", "w")
fo.writelines(combined_docs_sep)

## Option 2: Create corpus with a single document in each line

In [None]:
with open('docs_corpus.txt', 'w') as f:
  # Loop through each document, add separator and then write to file
  for doc in docs:
      doc_sep = ' '.join(doc)
      f.write("%s\n" % doc_sep)