<a href="https://colab.research.google.com/github/quanticedu/IntroToML/blob/main/ML102/IntroToPreprocessing.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Welcome to the CSV Files
In this lesson we'll learn the basics of data preprocessing for machine learning using Python and the Pandas library.

## Introducing Pandas

Pandas, short for 'panel data', makes data manipulation easy. With Pandas, a dataset can be converted into a Python object known as a **DataFrame**, which comes with a variety of handy methods.

In [None]:
import pandas as pd

# in vanilla Python, a DataFrame's structure is roughly analogous to the following:
dataframe_struct = {"Celestial Object": ["Mars", "Saturn", "Pluto (RIP)", "Mercury", "Europa", "Titan"],
                    "Miles away (in M)": [208, 1008, 3292, 64, 1100, 1212],
                    "Inhabited by aliens?": [True, False, False, True, True, True]}
# convert dataframe_struct into a DataFrame

# print the DataFrame's first five rows
df.head()

## Mount Google Drive for Data Import

The code block below will connect this Colab notebook to your Google Drive account. From there, we'll be able to load the dataset you've downloaded to your Google Drive into this notebook.


Run the code block and it'll ask you to authenticate via Google. Click yes through everything.

In [None]:
from google.colab import drive
drive.mount('/content/drive')

## Reading Data From External Files

With the basics out of the way, let's pull in the UFO data.

Assuming your dataset is downloaded to the root of your Google Drive, the path passed to `pd.read_csv()` below should work just fine.

In [None]:
ufo_data = pd.read_csv('/content/drive/My Drive/ufo_data.csv')

## An Optional Note On Slice Notation
If you've never seen colons used to splice ranges of data before in Python, some of the syntax that follows can be a little confusing.

In brief, the slice operator `:` works by extracting a subset of a list where the left-hand term is inclusive and the right-hand term is exclusive. So, if we wanted to slice a `ufo_list` into a smaller subset, we'd use `ufo_list[this_is_included:and_this_is_not].` We can also step backward from the end of the list with the `-` negative operator.

Learn more about ranges and slicing in the [Python docs](https://docs.python.org/3/tutorial/introduction.html#text) (you'll need to scroll down a little.)



In [None]:
# for instance, consider this list
list = ['a','b','c','d','e','f']
print(list[1:3]) # remember the list is zero indexed!
print(list[:4]) # the value after the colon indicates where the new sequence should end..
print(list[2:]) # .. while the first indicates where it should start.
print(list[:]) # if both terms are missing, *all* items are returned
print(list[-4:-1]) # from the Python docs: "Note that since -0 is the same as 0, negative indices start from -1." it's still exclusive, so 'f' is left out.
print(list[::-1]) # the optional third parameter indicates a 'step', so the code on this line reverses the element in its entirety

## Locating and Selecting Data with `loc` and `iloc`


### `loc` and labels

`loc` locates data via its label. By default, rows are usually not labeled, so our first examples will only select columns.

In [None]:
summaries = ufo_data.loc[:, "summary"]
summaries.head()

In [None]:
label_list = ufo_data.loc[:, ["summary","shape","duration"]]
label_list.head()

In [None]:
label_range = ufo_data.loc[:, "city":"shape"]
label_range.head()

#### An aside on index columns

In [None]:
# after setting the index column, `loc` will access each row via that column's value
col_indexed_data = ufo_data.set_index("shape", inplace=False, drop=True)
# now we can select rows using their labels with loc.
#fill in the code below to select all rows with the label 'fireball'
row_select = col_indexed_data.loc[]
print(row_select)

# now try printing summaries of all rows labeled "sphere"


### `iloc` and indexing

In [None]:
# how would we obtain the first row's fifth column using iloc? (Remember they're both zero indexed.)
ufo_data.iloc[]

In [None]:
# how would you return all rows and only the latitude and longitude columns?
#(they're the last two columns in the DataFrame.)
ufo_data.iloc[]

In [None]:
# can you return only the tenth through fifteenth rows and all columns?
ufo_data.iloc[]

## Feature Engineering Our Shape Predictor

A common ML task is combining or otherwise manipulating multiple features into a single feature, part of a process called 'feature engineering.'

Below, observe how we turn the two features 'city' and 'state' into a single feature using basic concatenation.

In [None]:
ufo_data["city and state"] = ufo_data.loc[:,"city"] + ", " + ufo_data.loc[:,"state"]
ufo_data["city and state"].head()

## Preparing Data for Machine Learning

In [None]:
# make a new dataframe called "shape_prediction_data" with three columns in this order:
#  "city and state"
#  "duration"
#  "shape"
shape_prediction_data =

shape_prediction_data.head()
shape_prediction_data.info()

Separate our `shape_prediction_data` into features (`city and state` and `duration`) and labels (`shape`) and assign them to `X` and `y` respectively.

In [None]:
# assuming "shape" is our label and "city and state" and "duration" are our features,
# can you fill in the following two variables?
X = shape_prediction_data.iloc['just the x column']
y = shape_prediction_data.iloc['just the y column']

print("feature set:\n", X.head())
print("label:\n", y.head())