## Reading age dataset
### Exploratory data analyis
- Load dataset and review
- Encoding:
    - As the data is OCR data there are a number of characters that have not been successfully read - these may show up as errors in the text
- Data review
    - The data has 3 columns and 7,584 rows
    - Age - 39 unique values
        - In the original dataset there are 15 observations where the data has not split correctly, and the age column includes             some text. As simple solution these columns have now been removed to form `df_clean`
        - Data checks were continued on `df_clean` 
    - After cleaning
        - There were no missing values in any column
        - Currently all data types are set to objects:
        -   Excerpt - to be set to string **to be updated**
        -   Book_and_page - to be set to string **to be updated**
        -   Age - to be set to categorical (tbc) **to be updated**
        -   Duplicated data
            -   Excerpt - 7,584 observations, with 5,934 unique observations (1,650 duplicates)
            -   Book and page - 7,583 observations, with 4,559 unique observations (3,024 duplicates)
            -   Execerpt & Book_and_page combined - 
    

In [70]:
# Import modules
import pandas as pd
import numpy as np
import nltk

In [36]:
# File path
file_path = 'ocr_data.xlsx'

# Read excel file, add column names
df = pd.read_excel(file_path, names=['excerpt','book_and_page','age'], usecols=[0, 1, 2], skiprows=[0])

# Print file
print(df.head())


                             excerpt book_and_page age
0   Tom and Ifra tug the big carrot.                 1
1       Tom, Ifra, Nick and Lin tug.                 1
2  Tom, Ifra, Nick, Lin and Sam tug.                 1
3        Nick, Lin, Sam and Kit tug.                 1
4            Up pops the big carrot!                 1


In [37]:
# Explore df
print(df.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7584 entries, 0 to 7583
Data columns (total 3 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   excerpt        7584 non-null   object
 1   book_and_page  7583 non-null   object
 2   age            7583 non-null   object
dtypes: object(3)
memory usage: 177.9+ KB
None


In [38]:
# Print shape
print(df.shape)

(7584, 3)


In [47]:
# Description
print(df.describe())

                                                  excerpt     book_and_page  \
count                                                7584              7583   
unique                                               5934              4559   
top     We are travelling through space on board the m...   Double Exposure   
freq                                                   11               181   

         age  
count   7583  
unique    39  
top       11  
freq     957  


In [120]:
# Check age variable
#print(df.age.unique())

def filter_list(col):
    """ Function that retains only integer values in a given column"""
    f_list = []
    for i, x in df['age'].iteritems():
        if isinstance(x, int) == True:
            f_list.append(i)
    return f_list

# Run filter list function on age column
filter_list = filter_list(df['age'])

# Filter for observations where age is an integer
df_clean = df.loc[filter_list]
print(df_clean.head())
print(df_clean.shape)


                             excerpt book_and_page age
0   Tom and Ifra tug the big carrot.                 1
1       Tom, Ifra, Nick and Lin tug.                 1
2  Tom, Ifra, Nick, Lin and Sam tug.                 1
3        Nick, Lin, Sam and Kit tug.                 1
4            Up pops the big carrot!                 1
(7569, 3)


In [123]:
# Check df_clean
df_clean.info()

# Describe df_clean
df_clean.describe()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 7569 entries, 0 to 7583
Data columns (total 3 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   excerpt        7569 non-null   object
 1   book_and_page  7569 non-null   object
 2   age            7569 non-null   object
dtypes: object(3)
memory usage: 236.5+ KB


Unnamed: 0,excerpt,book_and_page,age
count,7569,7569,7569
unique,5920,4547,27
top,We are travelling through space on board the m...,Double Exposure,11
freq,11,181,957


In [151]:
# Check for duplicated observations in excerpt column
excerpt_dup = df.excerpt.duplicated()
no_dup_rows_excerpt = sum([x for x in excerpt_dup if x == True])
print("There are {} duplicated rows in the excerpt column".format(no_dup_rows_excerpt))

# Check for duplicated observations in the excerpt and book and page columns concatenated
df['excerpt_and_book'] = df['excerpt'] + df['book_and_page']
excerpt_and_book_dup = df.excerpt_and_book.duplicated()
no_dup_rows_excerpt_and_book = sum([x for x in excerpt_and_book_dup if x == True])
print("There are {} duplicated rows in the concatenated excerpt and book column".format(no_dup_rows_excerpt_and_book))

There are 1650 duplicated rows in the excerpt column
Chip had a bug.Big bad bug pg8/572_right.png
There are 1603 duplicated rows in the concatenated excerpt and book column
