In [None]:
import pandas as pd
from sklearn.datasets import load_iris

## Pandas Basics

Pandas is the most popular data analysis library for Python. It's inspired by earlier features of SQL and R, but has continued to progress and add support for the latest hardware technologies (parallel, in-memory, cloud, ...) as well as advanced analysis capabilities.

The fundamental object we'll be using is the DataFrame. This is basically just a table, but with a lot of built-in, powerful data analysis methods.

In [None]:
# load the dataset (built-in to scikit-learn)
iris = load_iris()

# create a DataFrame of the dataset
ir = pd.DataFrame(iris.data)
# set column names
ir.columns = iris.feature_names
# add species information
ir['species'] = iris.target


# look at the head of the dataset
ir.head()

In [None]:
# fix the column names! no spaces or characters!
ir.columns = [x.replace(" ", "_").replace("_(cm)", "") for x in ir.columns]
ir.head()

In [None]:
# print the encoding scheme for species; 0 = Setosa , 1=Versicolor, 2= virginica
print (iris.target_names)

# write a small function to decode the names
def iris_decoder(species_code):
  if species_code == 0:
    return "Setosa"
  elif species_code == 1:
    return "Versicolor"
  else:
    return "Virginica"


In [None]:
# Apply the decoder using a lambda function (inline function)
ir['species_name'] = ir['species'].apply(lambda x: iris_decoder(x))

In [None]:
ir.head()

## Exploratory Data Analysis (EDA)

As a data scientist, you might know a lot about programming and statistics and have an area of specialty, but you often are asked to use your skills to solve a problem outside of your domain. One of the key skills you need to develop is the ability to explore a dataset so you can get more context about a particular domain. I'm guessing most of us don't know much about flowers or botany, so we're going to see what we can learn from the iris dataset!

In [None]:
# get summary statistics for each column in the dataset
# note that there is no missing data!
ir.describe()

In [None]:
# mean of each feature for each group
ir.groupby("species_name").mean()

In [None]:
# mean of each feature for each group
ir.groupby("species_name").std()

In [None]:
# how correlated are our variables? 
ir.corr()

## Data Visualization with Seaborn

In [None]:
import seaborn as sns

In [None]:
corr = ir.drop("species", axis=1, inplace=False).corr()

# Generate a custom diverging colormap
cmap = sns.diverging_palette(220, 10, as_cmap=True)

# Draw the heatmap with the mask and correct aspect ratio
sns.heatmap(corr, cmap=cmap, vmax=1.0, center=0.0,
            square=True, linewidths=.1, cbar_kws={"shrink": .8})

In [None]:
sns.boxplot(data=ir, x="species_name", y="sepal_length")

In [None]:
sns.boxplot(data=ir, x="species_name", y="sepal_width")

In [None]:
sns.scatterplot(x="sepal_length", y="sepal_width", hue="species_name", data=ir)

In [None]:
sns.scatterplot(x="petal_length", y="petal_width", hue="species_name", data=ir)

So now that we have an idea of what the data looks like, let's try to build a model! The most important part of being a professional data scientist is to make sure your model is solving the right problem. Here we can imagine someone discovering a new flower and not knowing what species it is. We can build a model that can predict the species given the measurements of the flower!

In [None]:
from sklearn.linear_model import LogisticRegression

In [None]:
x_vars = ir[["sepal_length", "sepal_width", "petal_length", "petal_width"]]
target = ir["species"]

clf = LogisticRegression(random_state=0, solver='lbfgs', multi_class='multinomial', max_iter=500).fit(x_vars, target)

# how did we do?
print(clf.score(x_vars, target))

In [None]:
# K-means, #KNN

In [None]:
# what mistake did we make? overfitting!
