# Examining Data (Thyroid Dataset)

This uses the UCI Machine learning dataset on Thyroid disease found here https://archive.ics.uci.edu/ml/datasets/thyroid+disease. The files to be used are as follows:

* `ann-train.data`: Source for training data
* `ann-test.data`: Source for testing the data against a model
* `ann-Readme`: Narrative for the dataset
* `ann-thyroid.names`: Meta information on the dataset

## Objectives

* Read the data as a raw file using Pandas library
* Examine missing values
* Examine unique values

## Narrative

*(Taken from Readme)* The problem is to determine whether a patient referred to the clinic is
hypothyroid. Therefore three classes are built: normal (not hypothyroid),
hyperfunction and subnormal functioning. Because 92 percent of the patients
are not hyperthyroid a good classifier must be significant better than 92%.

In [1]:
# Import the libraries
import pandas as pd

## Reading a CSV file using Pandas

In this section we store the data as is using the variables `df_train` and `df_test`. The `read_csv()` function from pandas will take in a CSV file and convert it to a dataframe for examination later on. Some things to consider when calling this function against the `ann-train.data` file:

1. The file is space delimited file and not a comma delimited file. Therefore, we have to explicitly define an extra parameter in the `read_csv` function called `delimiter` whose value is a space to indicate the delimiter to be used when parasing the file.
2. There are no headers in the file *(first row is the first data entity / data point of the dataset)*. By default, `read_csv` will use the first row as the column names. We can specify `header=None` as a parameter to the function so the default columns 

Documentation:https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html

In [7]:
# Variables to specify the path of the files (currently in the current directory)
csv_file_train = "./ann-train.data"
csv_file_test = "./ann-test.data"

# Store the training and test data
df_train = pd.read_csv(csv_file_train, header=None, delimiter=' ')
df_test = pd.read_csv(csv_file_test, header=None, delimiter=' ')

## Displaying Initial Data

We can examine the data frame by calling the `head(n)` method of the dataframe object or `tail(n)` method. The `head` method displays first `n` number of rows whereas the `tail` method displays the last `n` number of rows.

In [12]:
n = 5

df_train.head(n)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,14,15,16,17,18,19,20,21,22,23
0,0.73,0,1,0,0,0,0,0,1,0,...,0,0,0.0006,0.015,0.12,0.082,0.146,3,,
1,0.24,0,0,0,0,0,0,0,0,0,...,0,0,0.00025,0.03,0.143,0.133,0.108,3,,
2,0.47,0,0,0,0,0,0,0,0,0,...,0,0,0.0019,0.024,0.102,0.131,0.078,3,,
3,0.64,1,0,0,0,0,0,0,0,0,...,0,0,0.0009,0.017,0.077,0.09,0.085,3,,
4,0.23,0,0,0,0,0,0,0,0,0,...,0,0,0.00025,0.026,0.139,0.09,0.153,3,,
