# Load Data  
The files 'train_data_complete' and 'test_data_complete' together contain all data from the original **IMDb Movie Review 50k Dataset**, available at https://ai.stanford.edu/~amaas/data/sentiment/. The information is condensed to two plain tables that can be loaded and used straightforwardly. Here are some basic examples how to load and split the data for different purposes. 

In [1]:
import pandas as pd

First load the tables to data frames: 

In [2]:
df_train_complete = pd.read_excel('train_data_complete.xlsx', dtype = str)
df_test_complete = pd.read_excel('test_data_complete.xlsx', dtype = str)

Let's have a look at the data frames. The frames include six columns.  
*ID* are unmodified IDs from the original data set [1]. They are individual within each subset and in the range of 0 - 12499. (Subsets are training positive, training negative, test positive and test negative.)
While the *Review* column includes the original Markdown review texts, *Review clean* provides a preprocessed version of that. In the preprocessing, all letters were lowered and punctuation and irregular symbols have been removed.  
The *Rating* is the IMDb rating from 1 to 10 that is associated to that review. Note that the dataset does not contain reviews with rating 5 or 6. Reviews are grouped in positive (7-10) and negative (1-4) sentiments, which is indicated by *Sentiment*. The column *Set* specifies whether this rating is part of the training or test set.

In [3]:
df_train_complete

Unnamed: 0,ID,Review,Review clean,Rating,Sentiment,Set
0,0,Bromwell High is a cartoon comedy. It ran at t...,bromwell high is a cartoon comedy it ran at th...,9,pos,train
1,1,"If you like adult comedy cartoons, like South ...",if you like adult comedy cartoons like south p...,7,pos,train
2,2,Bromwell High is nothing short of brilliant. E...,bromwell high is nothing short of brilliant ex...,9,pos,train
3,3,"""All the world's a stage and its people actors...",all the world s a stage and its people actors ...,10,pos,train
4,4,FUTZ is the only show preserved from the exper...,futz is the only show preserved from the exper...,8,pos,train
...,...,...,...,...,...,...
24995,12495,"OK, I love bad horror. I especially love horro...",ok i love bad horror i especially love horror ...,1,neg,train
24996,12496,To be brutally honest... I LOVED watching Seve...,to be brutally honest i loved watching severed...,1,neg,train
24997,12497,I'm sure that the folks on the Texas/Louisiana...,i m sure that the folks on the texas louisiana...,4,neg,train
24998,12498,This film has the kernel of a really good stor...,this film has the kernel of a really good stor...,2,neg,train


In [4]:
df_test_complete

Unnamed: 0,ID,Review,Review clean,Rating,Sentiment,Set
0,0,I went and saw this movie last night after bei...,i went and saw this movie last night after bei...,10,pos,test
1,1,My boyfriend and I went to watch The Guardian....,my boyfriend and i went to watch the guardiana...,10,pos,test
2,2,My yardstick for measuring a movie's watch-abi...,my yardstick for measuring a movie s watch abi...,7,pos,test
3,3,How many movies are there that you can think o...,how many movies are there that you can think o...,7,pos,test
4,4,This movie was sadly under-promoted but proved...,this movie was sadly under promoted but proved...,10,pos,test
...,...,...,...,...,...,...
24995,12495,CyberTracker is set in Los Angeles sometime in...,cybertracker is set in los angeles sometime in...,3,neg,test
24996,12496,Eric Phillips (Don Wilson) is a secret service...,eric phillips don wilson is a secret service a...,3,neg,test
24997,12497,Plot Synopsis: Los Angeles in the future. Crim...,plot synopsis los angeles in the future crime ...,4,neg,test
24998,12498,"Oh, dear! This has to be one of the worst film...",oh dear this has to be one of the worst films ...,1,neg,test


---
If you need all data in a single data frame you can just merge them. 

In [5]:
df_complete = df_train_complete.append(df_test_complete, ignore_index=True)
df_complete

Unnamed: 0,ID,Review,Review clean,Rating,Sentiment,Set
0,0,Bromwell High is a cartoon comedy. It ran at t...,bromwell high is a cartoon comedy it ran at th...,9,pos,train
1,1,"If you like adult comedy cartoons, like South ...",if you like adult comedy cartoons like south p...,7,pos,train
2,2,Bromwell High is nothing short of brilliant. E...,bromwell high is nothing short of brilliant ex...,9,pos,train
3,3,"""All the world's a stage and its people actors...",all the world s a stage and its people actors ...,10,pos,train
4,4,FUTZ is the only show preserved from the exper...,futz is the only show preserved from the exper...,8,pos,train
...,...,...,...,...,...,...
49995,12495,CyberTracker is set in Los Angeles sometime in...,cybertracker is set in los angeles sometime in...,3,neg,test
49996,12496,Eric Phillips (Don Wilson) is a secret service...,eric phillips don wilson is a secret service a...,3,neg,test
49997,12497,Plot Synopsis: Los Angeles in the future. Crim...,plot synopsis los angeles in the future crime ...,4,neg,test
49998,12498,"Oh, dear! This has to be one of the worst film...",oh dear this has to be one of the worst films ...,1,neg,test


For that case above, the *Set* column is relevant to differentiate the data. If you want to keep sets seperated you can just as well drop that. 

In [6]:
df_train_complete.drop('Set', axis=1)
df_test_complete.drop('Set', axis=1)

Unnamed: 0,ID,Review,Review clean,Rating,Sentiment
0,0,I went and saw this movie last night after bei...,i went and saw this movie last night after bei...,10,pos
1,1,My boyfriend and I went to watch The Guardian....,my boyfriend and i went to watch the guardiana...,10,pos
2,2,My yardstick for measuring a movie's watch-abi...,my yardstick for measuring a movie s watch abi...,7,pos
3,3,How many movies are there that you can think o...,how many movies are there that you can think o...,7,pos
4,4,This movie was sadly under-promoted but proved...,this movie was sadly under promoted but proved...,10,pos
...,...,...,...,...,...
24995,12495,CyberTracker is set in Los Angeles sometime in...,cybertracker is set in los angeles sometime in...,3,neg
24996,12496,Eric Phillips (Don Wilson) is a secret service...,eric phillips don wilson is a secret service a...,3,neg
24997,12497,Plot Synopsis: Los Angeles in the future. Crim...,plot synopsis los angeles in the future crime ...,4,neg
24998,12498,"Oh, dear! This has to be one of the worst film...",oh dear this has to be one of the worst films ...,1,neg


--- 
If you want to use the tables for model training directly (as in https://kgptalkie.com/sentiment-classification-using-bert/) you might want to have only review text and the Sentiment in a table: 

In [7]:
df_train = df_train_complete[['Review','Sentiment']]
df_test = df_test_complete[['Review','Sentiment']]

You can of course also use the actual ratings, or cleaned review texts. 

In [8]:
df_train = df_train_complete[['Review clean','Rating']]
df_test = df_test_complete[['Review clean','Rating']]

--- 
Tables can be saved as follows:

In [9]:
df_train.to_excel('train_1.xlsx', engine='xlsxwriter', index=False)
df_test.to_excel('test_1.xlsx', engine='xlsxwriter', index=False)
df_complete.to_excel('IMDb-Movie-Reviews-50k_complete.xlsx', engine='xlsxwriter', index=False)

---
### Reference
[1] https://ai.stanford.edu/~amaas/data/sentiment/   
[2] Maas, A., Daly, R. E., Pham, P. T., Huang, D., Ng, A. Y., & Potts, C. (2011, June). Learning  word vectors for sentiment analysis. In Proceedings of the 49th annual meeting of the association for computational linguistics: Human language technologies (pp. 142-150).