# "Hello World of Machine Learning"
> "This should be the first step if you're new to Machine Learning. We'll also load and understand our data."

- toc: true
- branch: master
- badges: true
- comments: true
- author: Rishiraj Acharya
- categories: [machine learning, data exploration]
- image: images/nb/decisiontree.png
- hide: false
- search_exclude: false

# Introduction

We will start at the very beginning: what exactly is “machine learning”, and how is it used in the real world? We’ll learn the answers to these questions and explore the basics of decision trees, as we start to build a strong foundation for some of the most cutting-edge techniques in data science.

We’ll learn all about pandas, the primary tool used by data scientists for exploring and manipulating data. Then, we’ll use our new knowledge to examine a dataset of Chicken Tikka Masalas.

# Who's ML? I'll do you one better, why is ML?

Let's pretend for a moment that there's a Chicken Tikka Masala cooking competition and we want to come up with a recipe that will get us the highest rating on a continuous 10-point scale. One approach would just be to use the recipe we have on hand and hope for the best but instead what if we put our newly acquired machine learning skills to the test. We can do this by finding hundreds, thousands or even millions of different Chicken Tikka Masala recipes and listing out all of their ingredients and corresponding Chicken Tikka Masala cooking competition ratings into a table like this where each row is a recipe and each column is the amount of ingredients in the given recipe.

In [7]:
tikkas

Unnamed: 0,Rating,Chicken,Ginger,Garlic,Onion,Tomato,Garam Masala,Salt,Cream
0,8.5,2.25,2,2,2.0,4,1,1.0,2
1,1.0,1.5,6,1,1.0,0,0,12.0,0
2,6.3,2.0,2,3,1.5,2,1,0.5,1
3,5.0,4.5,1,5,5.0,3,1,3.0,0
4,8.0,2.0,2,0,1.5,3,2,1.0,3
5,7.0,2.0,1,1,0.5,4,1,0.0,4


Then we could fit a decision tree model to this dataset and use it to help predict the best combination of ingredients in order to make the highest rated Chicken Tikka Masala. What we're hoping to accomplish by doing this is to use the model we build to uncover or explain the relationship between the inputs (Chicken Tikka Masala ingredients) and the output (Chicken Tikka Masala rating). A common pattern that you're going to see is that we:


1.   Define a model
2.   Fit a model
3.   Make predictions
4.   Validate our model


When we define a model what you want to do is think to yourself, "what model would I like to use?" Of course, as you develop your machine learning skills, you'll have a wide array of models to choose from and you'll start to develop a sense as to which models will be best suited for any given task. If we want to think about this in more general terms we could abstractly represent a model using code that says `my_model = ModelName()`.

Fitting a model is another way to refer to training a model and essentially what we're doing is we're taking the model that we've defined and we're applying that model to our dataset and we're asking it to start pulling out the underlying patterns in the data so an abstract code snippet for what fitting a model might look like is `my_model.fit(features, target)`.

Making predictions is what happens once we've built a model and we want to generalize it or extend it or apply it to data that it has never seen before and I know it's kind of silly to talk about what a model sees since technically it's a piece of code and it doesn't have eyes but just roll on me on this one. A general way to write making predictions would be something like `my_model.predict(data)`.

# Data Exploration

It's always good to know what you're getting into before you start something and one of the best ways to have an idea of what you're getting into is to ask a lot of questions. So let's think for a minute about our Chicken Tikka Masala problem, what we're trying to do is map a Chicken Tikka Masala recipe to its rating. So what we can hypothetically do is go out and collect all of this recipe data for Chicken Tikka Masalas and organize that data into something called a dataframe. Once we have our data organized into a dataframe we can begin to use some python code to ask and answer various questions about our data. Now you can ask almost an infinite number of questions but some of my favorite starting questions are things like, "What are the names of the variables in my data?", "How many variables are in my dataframe?", "How many observations or rows are in my dataframe?". I can also ask things like, "What do the first few rows of my dataframe look like?", "Are there any missing values in my dataframe?". I can even ask, "What's the average value of a numerical variable within my dataframe?".

Let's try answering some of these questions on our Chicken Tikka Masala dataset by using the Pandas library. So for example if we want to know how many observations and variables or rows and columns are in our dataframe, we can use `tikkas.shape` to see that we have x rows and y columns.

In [1]:
import pandas as pd
tikkas = pd.read_csv("/content/tikka.csv")

In [2]:
# Question: how many rows and columns in tikkas?
tikkas.shape

# There are 6 rows and 9 columns in the tikkas dataframe

(6, 9)

To look at the names of the variables in our tikkas dataframe we can run `tikkas.columns`.

In [3]:
# Question: what are the names of my variables (columns)?
tikkas.columns

Index(['Rating', 'Chicken', 'Ginger', 'Garlic', 'Onion', 'Tomato',
       'Garam Masala', 'Salt', 'Cream'],
      dtype='object')

And if we want to look at the first few rows of our dataframe, well then we can use `tikkas.head()`.

In [4]:
# Question: what do the first few rows of my dataframe contain?
tikkas.head()

Unnamed: 0,Rating,Chicken,Ginger,Garlic,Onion,Tomato,Garam Masala,Salt,Cream
0,8.5,2.25,2,2,2.0,4,1,1.0,2
1,1.0,1.5,6,1,1.0,0,0,12.0,0
2,6.3,2.0,2,3,1.5,2,1,0.5,1
3,5.0,4.5,1,5,5.0,3,1,3.0,0
4,8.0,2.0,2,0,1.5,3,2,1.0,3


What if there's missing data? We can explore that by running `tikkas.isnull().sum()`.

In [5]:
# Question: are there missing values in my dataframe?
tikkas.isnull().sum()

# There are no missing values in the tikkas dataframe

Rating          0
Chicken         0
Ginger          0
Garlic          0
Onion           0
Tomato          0
Garam Masala    0
Salt            0
Cream           0
dtype: int64

We can also get a table with summary values by using `tikkas.describe()`.

In [6]:
# Question: what are the summary statistics for my dataframe?
tikkas.describe()

Unnamed: 0,Rating,Chicken,Ginger,Garlic,Onion,Tomato,Garam Masala,Salt,Cream
count,6.0,6.0,6.0,6.0,6.0,6.0,6.0,6.0,6.0
mean,5.966667,2.375,2.333333,2.0,1.916667,2.666667,1.0,2.916667,1.666667
std,2.73252,1.069462,1.861899,1.788854,1.594261,1.505545,0.632456,4.565267,1.632993
min,1.0,1.5,1.0,0.0,0.5,0.0,0.0,0.0,0.0
25%,5.325,2.0,1.25,1.0,1.125,2.25,1.0,0.625,0.25
50%,6.65,2.0,2.0,1.5,1.5,3.0,1.0,1.0,1.5
75%,7.75,2.1875,2.0,2.75,1.875,3.75,1.0,2.5,2.75
max,8.5,4.5,6.0,5.0,5.0,4.0,2.0,12.0,4.0


We've covered a ton of information related to the basic data exploration. I hope you've learned something in this blog post. I know I have and I've really enjoyed learning with you. So when you're ready meet me in the next post where we're going to build our first Machine Learning model.