# INTRODUCTION

In this coursework we'll do data exploration, preparation and analysis, correlation and feature selection, probabilistic data analysis and clustering

NOTE: This notebook only requires Python Anaconda3 to run https://www.anaconda.com/distribution/#download-section

Alternatively:
    - Install "pipenv"
    - From the root diretory of the project, run "pipenv shell" and when that finishes,
    - "pipenv install"
    - Then "jupyter-notebook classy/view/jupyter_notebook"
    - When the browser opens, click "Corsework One"

# DATA EXPLORATION

## Loading the data

We Have been provided with CSV files containing instances of road sign images that are gray-scalled. I have already downloaded these, and written a library (see the model folder) to read CSV files and return a Pandas dataframe, and to write pandas dataframes into HDF5 files.

Let's go ahead and load the images CSV.

In [1]:
# First we add the project into PYTHONPATH so our notebook can access the code in our MVC structure
import set_sys_path

In [2]:
# Now we import the data read-write library from the model package, and initialise the read and write objects
from classy.model.data.read import Reader
from classy.model.data.write import Writer

reader = Reader()
writer = Writer()

In [3]:
# Let's see what data we have been provided with
reader.list_data_files(file_type="csv")

['y_train_smpl_0.csv',
 'y_train_smpl_1.csv',
 'y_train_smpl_3.csv',
 'x_train_gr_smpl.csv',
 'y_train_smpl_2.csv',
 'y_train_smpl_6.csv',
 'y_train_smpl_7.csv',
 'y_train_smpl.csv',
 'y_train_smpl_5.csv',
 'y_train_smpl_4.csv',
 'y_train_smpl_9.csv',
 'y_train_smpl_8.csv']

The x_train_gr_smpl.csv is the main dataset that contains the instances and features (X in datamining lingo). The y_* are the labels for each intance (except y_train_smpl.csv).

The labels are in the vertical order of instances.

In [4]:
# Now lets load the main (X) dataset
x_train_gr_smpl = reader.load_data("x_train_gr_smpl", file_type="csv")

## Exploration

In [5]:
# How does the data look like?
x_train_gr_smpl

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,2294,2295,2296,2297,2298,2299,2300,2301,2302,2303
0,30.0,29.0,28.0,29.0,31.0,30.0,29.0,28.0,27.0,26.0,...,31.0,32.0,35.0,38.0,39.0,39.0,40.0,39.0,39.0,38.0
1,31.0,31.0,33.0,32.0,31.0,30.0,29.0,28.0,28.0,28.0,...,32.0,34.0,35.0,36.0,36.0,37.0,38.0,38.0,37.0,37.0
2,30.0,30.0,31.0,29.0,28.0,27.0,26.0,28.0,30.0,31.0,...,33.0,35.0,37.0,37.0,38.0,39.0,38.0,38.0,39.0,40.0
3,26.0,25.0,24.0,24.0,24.0,27.0,28.0,29.0,29.0,30.0,...,32.0,34.0,36.0,37.0,38.0,42.0,40.0,37.0,36.0,36.0
4,25.0,26.0,28.0,28.0,28.0,28.0,28.0,27.0,26.0,25.0,...,30.0,31.0,33.0,37.0,38.0,37.0,36.0,36.0,35.0,35.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
12655,8.0,8.0,7.0,7.0,8.0,8.0,8.0,8.0,8.0,8.0,...,11.0,10.0,9.0,9.0,10.0,9.0,9.0,10.0,11.0,11.0
12656,7.0,7.0,8.0,8.0,9.0,8.0,8.0,9.0,8.0,9.0,...,10.0,10.0,10.0,9.0,9.0,9.0,9.0,9.0,9.0,9.0
12657,9.0,10.0,9.0,9.0,9.0,9.0,11.0,10.0,10.0,10.0,...,9.0,9.0,9.0,9.0,8.0,9.0,9.0,10.0,9.0,9.0
12658,8.0,7.0,6.0,6.0,6.0,6.0,7.0,6.0,6.0,6.0,...,10.0,9.0,9.0,9.0,9.0,9.0,10.0,10.0,9.0,9.0


In [6]:
# what datatypes are they?
x_train_gr_smpl.dtypes

0       float64
1       float64
2       float64
3       float64
4       float64
         ...   
2299    float64
2300    float64
2301    float64
2302    float64
2303    float64
Length: 2304, dtype: object

In [7]:
# They all seem to be integers. Are they really all integers (floats with 0 after the decimal place)? Let's confirm

import numpy as np

all(x_train_gr_smpl.apply(np.vectorize(float.is_integer)))

True

In [8]:
# Confirmed. Let's convert them all to what they really are; integers.
x_train_gr_smpl = x_train_gr_smpl.astype(int)

In [9]:
x_train_gr_smpl

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,2294,2295,2296,2297,2298,2299,2300,2301,2302,2303
0,30,29,28,29,31,30,29,28,27,26,...,31,32,35,38,39,39,40,39,39,38
1,31,31,33,32,31,30,29,28,28,28,...,32,34,35,36,36,37,38,38,37,37
2,30,30,31,29,28,27,26,28,30,31,...,33,35,37,37,38,39,38,38,39,40
3,26,25,24,24,24,27,28,29,29,30,...,32,34,36,37,38,42,40,37,36,36
4,25,26,28,28,28,28,28,27,26,25,...,30,31,33,37,38,37,36,36,35,35
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
12655,8,8,7,7,8,8,8,8,8,8,...,11,10,9,9,10,9,9,10,11,11
12656,7,7,8,8,9,8,8,9,8,9,...,10,10,10,9,9,9,9,9,9,9
12657,9,10,9,9,9,9,11,10,10,10,...,9,9,9,9,8,9,9,10,9,9
12658,8,7,6,6,6,6,7,6,6,6,...,10,9,9,9,9,9,10,10,9,9


So this is <b>ratio</b> data, all in the same <b>range</b> (0-255).

There are also no null values. All cells have values (pixel brightness count), so no null cells need to be replaced with the label mean for that cell for example.

We have 12660 instances with 2304 features each (48*48 grayscale image unrolled into one horizontal array)

In [10]:
# Now let's see some statistics about our data
x_train_gr_smpl.describe()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,2294,2295,2296,2297,2298,2299,2300,2301,2302,2303
count,12660.0,12660.0,12660.0,12660.0,12660.0,12660.0,12660.0,12660.0,12660.0,12660.0,...,12660.0,12660.0,12660.0,12660.0,12660.0,12660.0,12660.0,12660.0,12660.0,12660.0
mean,90.275908,90.29684,90.331043,90.36951,90.608847,90.85237,91.113033,91.427488,91.566588,91.775355,...,72.12346,71.547393,71.399368,70.980253,70.064455,69.243681,68.339336,67.716983,67.734992,67.784123
std,79.531811,79.674242,79.81624,79.896452,79.958141,80.082173,80.266651,80.406439,80.439131,80.457116,...,65.643296,65.317373,65.485237,65.649521,65.109495,64.328815,63.312147,62.688005,62.951496,63.289188
min,5.0,6.0,5.0,6.0,6.0,6.0,6.0,6.0,5.0,6.0,...,6.0,6.0,6.0,6.0,5.0,6.0,6.0,6.0,6.0,5.0
25%,29.0,29.0,29.0,28.0,28.0,28.0,28.0,28.0,28.0,29.0,...,26.0,25.0,25.0,25.0,25.0,24.0,24.0,24.0,24.0,24.0
50%,55.0,56.0,55.0,55.0,56.0,56.0,56.0,56.0,57.0,57.0,...,47.0,47.0,47.0,46.0,45.0,44.0,44.0,44.0,43.0,43.0
75%,136.0,136.0,135.0,136.0,137.0,137.0,140.0,140.0,140.0,140.0,...,94.0,93.0,93.0,92.0,91.0,90.0,88.0,87.0,86.0,87.0
max,255.0,255.0,255.0,255.0,255.0,255.0,255.0,255.0,255.0,255.0,...,255.0,255.0,255.0,255.0,255.0,255.0,255.0,255.0,255.0,255.0


## Visualization

# DATA PREPARATION

## Correlation

## Feature Selection

# CLASSIFICATION

## Naïve Bayes

## Bayesian Network

# CLUSTERING

## k-means

# RESEARCH QUESTION