# Data & Data Reader 
**Height Yan**<br>
**Sai Ramesh**<br>
**Ben Robbins**<br><br>
In this submission we demonstrate the demensions of our data and provide a Data Reader that pre-processes the raw data into formats that are feasible for machine learning models. The submission comprises the following sections:<br><br>
*[1. Data Overview](#1.-Data-Overview)*<br>
&emsp;&emsp;*[1.1. Data Preprocess](#1.1.-Data-Preprocess)*<br>
&emsp;&emsp;*[1.2. Export Data](#1.2.-Export-Data)*<br>
*[2. Data Reader](#3.-Data-Reader)*<br>

## 1. Data Overview
In this section, we aill summarize and visualize the dimensions of our corpus.

In [1]:
import pandas as pd
import numpy as np
import re
import os
from nltk.tokenize import sent_tokenize

## 1.1. Data Preprocess
First, we need to preprocess the data to get more accurate Data Summary, including extracting sentences from the .srt files and removing invalid elements such as html tags, unicode sign, etc...

In [2]:
def data_preprocess():
    corpus_raw_dir = "./corpus_raw/"
    movie_df = pd.DataFrame()
    for label in os.listdir(corpus_raw_dir):
        # Allocate the movie & trailer subtitle folders
        movie_dir = f"{corpus_raw_dir}{label}/movie_corpus/"
        trailer_dir = f"{corpus_raw_dir}{label}/trailer_corpus/"
        
        # Fetch data
        cur_movie_df = pd.DataFrame()
        cur_movie_df["label"] = [label] * len(os.listdir(movie_dir))
        cur_movie_df["movie"] = [re.sub("\.srt|\.txt", "", filename) for filename in os.listdir(movie_dir)]
        cur_movie_df["corpus_movie"] = [sentence_parition(f"{movie_dir}{filename}") for filename in os.listdir(movie_dir)]
        cur_movie_df["corpus_trailer"] = [sentence_parition(f"{trailer_dir}{filename}") for filename in os.listdir(trailer_dir)]
        cur_movie_df["num_sentence_movie"] = cur_movie_df["corpus_movie"].apply(lambda row: len(row))
        cur_movie_df["num_sentence_trailer"] = cur_movie_df["corpus_trailer"].apply(lambda row: len(row))
        cur_movie_df["num_words_movie"] = cur_movie_df["corpus_movie"].apply(lambda row: len(' '.join(row).split()))
        cur_movie_df["num_words_trailer"] = cur_movie_df["corpus_trailer"].apply(lambda row: len(' '.join(row).split()))
        cur_movie_df["num_letters_movie"] = cur_movie_df["corpus_movie"].apply(lambda row: len(' '.join(row)))
        cur_movie_df["num_letters_trailer"] = cur_movie_df["corpus_trailer"].apply(lambda row: len(' '.join(row)))
        movie_df = pd.concat([movie_df, cur_movie_df])
        
        # Display overview
        print(f"{label}:")
        print(f"    {'Movie subtitle file count:':<40} {len(os.listdir(movie_dir))}")
        print(f"    {'Trailer subtitle file count:':<40} {len(os.listdir(trailer_dir))}")
        print(f"    {'Total movie sentences:':<40} {cur_movie_df['num_sentence_movie'].sum()}")
        print(f"    {'Total trailer sentences:':<40} {cur_movie_df['num_sentence_trailer'].sum()}")
        print(f"    {'Total movie words:':<40} {cur_movie_df['num_words_movie'].sum()}")
        print(f"    {'Total trailer words:':<40} {cur_movie_df['num_words_trailer'].sum()}")
        print(f"    {'Total movie letters:':<40} {cur_movie_df['num_letters_movie'].sum()}")
        print(f"    {'Total trailer letters:':<40} {cur_movie_df['num_letters_trailer'].sum()}")
        
    display(movie_df.head())
    display(movie_df.tail(), movie_df.shape)
    return movie_df
       
        

        
def sentence_parition(filepath):
    with open(filepath, encoding="latin") as file:
        context = file.read()
    if filepath[-4:] == ".srt":
        pattern = ".*? --> .*?\n(.*?)\n\n"
    else:
        pattern = "(.*?)\n"
    context = re.sub("\[.*?\]|\(.*?\)|.*?: |\<.*?\>", "", context) # Remove all subtitle tags (if any)
    sentences = re.findall(pattern, context, re.S)
    sentences = [sentence.replace("\n", " ") for sentence in sentences] # There should be no newlines in a sentence
    sentences = [sentence for sentence in sentences if sentence] # Drop nan
    return sentences
              
dataframe = data_preprocess()

DC:
    Movie subtitle file count:               19
    Trailer subtitle file count:             19
    Total movie sentences:                   27179
    Total trailer sentences:                 538
    Total movie words:                       166038
    Total trailer words:                     3249
    Total movie letters:                     890994
    Total trailer letters:                   16746
Marvel:
    Movie subtitle file count:               30
    Trailer subtitle file count:             30
    Total movie sentences:                   46713
    Total trailer sentences:                 799
    Total movie words:                       271882
    Total trailer words:                     4678
    Total movie letters:                     1442127
    Total trailer letters:                   23788


Unnamed: 0,label,movie,corpus_movie,corpus_trailer,num_sentence_movie,num_sentence_trailer,num_words_movie,num_words_trailer,num_letters_movie,num_letters_trailer
0,DC,Aquaman,"[Jules Verne once wrote,, ""Put two ships in th...","[my parents were from different worlds, and no...",1133,51,7136,279,38086,1442
1,DC,Batman Begins,[- Can I see? - Finders keepers. And I found i...,"[Tell us, Mr. Wayne., What do you fear?, How d...",1391,38,9608,190,52862,1062
2,DC,Batman v Superman Dawn of Justice,"[There was a time above..., A time before..., ...","[today is a day for truth the world needs, to ...",1201,35,7445,210,40303,1059
3,DC,Birds of Prey,"[ They say, if you wanna tell a story right,, ...","[can I help you, y-yes yes you can I'm here to...",1879,13,10805,57,55220,292
4,DC,Constantine,"[MANUEL., MANUEL!, MANUEL., I THINK..., I THIN...","[what cold sure Bobby's know, mr. Constantine ...",937,18,5377,124,29169,627


Unnamed: 0,label,movie,corpus_movie,corpus_trailer,num_sentence_movie,num_sentence_trailer,num_words_movie,num_words_trailer,num_letters_movie,num_letters_trailer
25,Marvel,Thor The Dark World,"[Long before the birth of light, there was dar...","[after all this time now you come to, visit me...",1128,26,6332,142,33609,683
26,Marvel,Thor,"[Wait for it., - Can I turn on the radio? - No...","[without it I think you want to see this, deat...",1091,19,6558,132,34565,655
27,Marvel,X-Men Apocalypse,"[Mutants, born with extraordinary abilities., ...","[I saw the end of the world, i could feel all ...",1374,30,7956,208,41605,996
28,Marvel,X-Men Dark Phoenix,"[Who are we?, Are we simply what others want u...","[Why did you make me do that?, Look at me focu...",1194,18,6428,153,33341,754
29,Marvel,X-Men Days of Future Past,"[The future., A dark desolate world., World in...","[Oh, what's the last thing you remember, glimp...",1259,19,7485,106,39658,511


(49, 10)

## 1.2. Export Data
Once we have cleaned up the data, we can store them into local files for future use.

In [3]:
dataframe.to_csv("corpus_cleaned.csv", index=False)

## 2. Data Reader
It is also neccessary to write a function to read the data we just cleaned. In this section, we write a data reader for future use. Since we still have not decide which model to use, we will not train the data in this submission.

In [4]:
def get_corpus(filepath):
    movie_df = pd.read_csv(filepath)
    display(movie_df.head())
    display(movie_df.tail(), movie_df.shape)
    return movie_df

dataframe = get_corpus("./corpus_cleaned.csv")
print("Successfully loaded the corpus data!")

Unnamed: 0,label,movie,corpus_movie,corpus_trailer,num_sentence_movie,num_sentence_trailer,num_words_movie,num_words_trailer,num_letters_movie,num_letters_trailer
0,DC,Aquaman,"['Jules Verne once wrote,', '""Put two ships in...","['my parents were from different worlds', 'and...",1133,51,7136,279,38086,1442
1,DC,Batman Begins,['- Can I see? - Finders keepers. And I found ...,"['Tell us, Mr. Wayne.', 'What do you fear?', '...",1391,38,9608,190,52862,1062
2,DC,Batman v Superman Dawn of Justice,"['There was a time above...', 'A time before.....","['today is a day for truth the world needs', '...",1201,35,7445,210,40303,1059
3,DC,Birds of Prey,"[' They say, if you wanna tell a story right,'...","['can I help you', ""y-yes yes you can I'm here...",1879,13,10805,57,55220,292
4,DC,Constantine,"['MANUEL.', 'MANUEL!', 'MANUEL.', 'I THINK...'...","[""what cold sure Bobby's know"", ""mr. Constanti...",937,18,5377,124,29169,627


Unnamed: 0,label,movie,corpus_movie,corpus_trailer,num_sentence_movie,num_sentence_trailer,num_words_movie,num_words_trailer,num_letters_movie,num_letters_trailer
44,Marvel,Thor The Dark World,"['Long before the birth of light, there was da...","['after all this time now you come to', 'visit...",1128,26,6332,142,33609,683
45,Marvel,Thor,"['Wait for it.', '- Can I turn on the radio? -...","['without it I think you want to see this', 'd...",1091,19,6558,132,34565,655
46,Marvel,X-Men Apocalypse,"['Mutants, born with extraordinary abilities.'...","['I saw the end of the world', 'i could feel a...",1374,30,7956,208,41605,996
47,Marvel,X-Men Dark Phoenix,"['Who are we?', 'Are we simply what others wan...","['Why did you make me do that?', ""Look at me f...",1194,18,6428,153,33341,754
48,Marvel,X-Men Days of Future Past,"['The future.', 'A dark desolate world.', 'Wor...","['Oh', ""what's the last thing you remember"", '...",1259,19,7485,106,39658,511


(49, 10)

Successfully loaded the corpus data!
