### Data Cleaning and Processing: Cambridge English Readability Data Set and the One Stop English Corpus

This notebook showcases the cleaning process I undertook to prepare the *Cambridge English Readability Data Set* for data analysis. 

The data set can be found here: https://ilexir.co.uk/datasets/index.html\

I would like to acknowledge the authors below per se the licence agreement and that the data set is used solely for learning purposes.

Citation:

*Yannakoudakis, Helen and Briscoe, Ted and Medlock, Ben, ‘A New Dataset and Method for Automatically Grading ESOL Texts’<br> Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies.*

#### Note: 
Cleaning of the *Cambridge English Readability Data Set* was an iterative process of manual text examination through python looping. Because the size was manageable, my data cleaning included, for some examples, taking notes of certain words or phrases causing probelms, like list header items, as well as the examining the lengths of title lines to make a cut-off for texts that didn't have titles while not deleting texts that had no title with shorter lengths. I included some of the code testing blocks commented out, to get a sense of the cleaning process. I believe I did a pretty good job of cleaning the documents, though I also note that some list items and other small words that should be deleted were missed and an even more thourough cleaning should be conducted.

Deleted files **FCE/19.txt, FCE/28.txt, FCE/32.txt** as they were duplicates of **FCE/20.txt, FCE/29.txt, FCE/33.txt**. 

Modified **PET/34.txt** due to lack of spacing in two sentences.

All cleaning functions can be found in the *cleaning_nlp.py* file

Otherwise, the original files have been untouched and only cleaned after loading and processing through the process_directory function.

In [1]:
import pandas as pd
import cleaning_nlp as cl

In [2]:
documents, levels, first_line_lens = cl.process_directory()
# documents, doc_list, levels, first_line_lens = cl.process_directory(cefr=False) # -> Created second version just as back-up 

Currently processing: KET

Removed First line: Otters 

Removed First line: BICYCLES 

Removed First line: Bill Prince-Smith

Removed First line: ESTHER'S STORY

Removed First line: A HISTORY OF AIR TRAVEL 

Removed First line: CANADA GEESE

Removed First line: Memo 

Removed: Memo

Removed First line: BURGLARS LOVE THE AFTERNOON

Removed First line: CROCODILES 

Removed First line: Madame Tussaud's

Removed First line: The Weather 

Removed First line: The Elephant Show 

Removed: by Daniel Allsop, age 14 

... ... ...
Removed First line: The Heat is On

Removed First line: Music - The Challenge Ahead

Removed First line: Metals

Removed First line: Work

Removed First line: The Lure of the Kitchen

Removed: SAILING

Removed First line: BROADCASTING: The Social Shaping of a Technology

Removed First line: 0ral History

CPE has 69 files

Number of First Line Deletions: 59



In [3]:
# Final Manual Inspection after Cleaning
# for row in cl.cefr_to_data_frame(documents, levels)['documents']:
#     print(row)
#     print('\n')

In [4]:
dataframe = cl.cefr_to_data_frame(documents, levels)

In [5]:
# dataframe

In [6]:
# Write the cleaned data to csv file

# dataframe.to_csv('data/cefr_readings.csv', index=False)
# dataframe.to_csv('data/cefr_readings_numeric.csv', index=False)

#### Note 

The data set below was processed using my own functions and put into lists in tuples with various stored statistics on the data set. I originally parsed the directory */Texts-Together-OneCSVperFile* (found in the github link below) and loaded the csv files into one data frame, but I noticed that there were lots of spacing issues in many of the sentences. I changed course and processed the text files instead. 

Because of the enormity of the data set, I could not manually look through it all. However, after looking through various random ranges of samples, I concluded that the spacing isssue was not present and that the data looked pretty clean.

The One Stop English Corpus can be found here: https://github.com/nishkalavallabhi/OneStopEnglishCorpus

I would like to acknowledge the authors below per se the licence agreement and that the data set is used solely for learning purposes.

Citation:

*OneStopEnglish corpus: A new corpus for automatic readability assessment and text simplification Sowmya Vajjala and Ivana Lučić 2018<br>
Proceedings of the Thirteenth Workshop on Innovative Use of NLP for Building Educational Applications, pages 297–304. Association for Computational Linguistics.*

In [7]:
one_stop_df = cl.get_one_stop_dataframe()

In [8]:
# one_stop_df

In [9]:
one_stop_df.to_csv('data/one_stop.csv', index=False)

In [10]:
# for i in range(6800, 7300, 2):
#     print(one_stop_df.documents[i])

### Data Cleaning Inspection 
Below are some samples of earlier work from the data inspection which helped inform me how to clean and make my process_directory file

In [11]:
# Inspection of first lines used to determine the cut off point for first line lengths
# for i, e in enumerate(first_line_lens):
#     if (e > 45):# and (e > 54):
#         print(i, e)
#         print(documents[i][:e])

In [12]:
# Inspection code before final inspection
# c = 0
# for i in range(len(documents)):
#     print('********Document {n} ***********'.format(n=i))
#     print(documents[i])
#     if c == 190:
#         break
#     c += 1

# for i in range(200, 300, 1):
#     print('********Document {n} ***********'.format(n=i))
#     print(docs[i])