# DATA SCIENTIST: NATURAL LANGUAGE PROCESSING SPECIALIST

## Find the flag!

Can you guess which continent this flag comes from?

## Flag of Reunion

What are some of the features that would clue you in? Maybe some of the colors are good indicators. The presence or absence of certain shapes could give you a hint. In this project, we’ll use decision trees to try to predict the continent of flags based on several of these features.

We'll explore which features are the best to use and the best way to create your decision tree.

### Datasets

The original data set is available at the [UCI Machine Learning Repository][uci]

## Tasks

[uci]: https://archive.ics.uci.edu/ml/datasets/Flags

In [1]:
import seaborn as sns
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt
from sklearn.tree import DecisionTreeClassifier
from sklearn import tree

### Investigate the data

1. The dataset has been loaded for you in script.py and saved as a dataframe named df. Some of the input and output features of interest are:

    - **name**: Name of the country concerned
    - **landmass**: 1=N.America, 2=S.America, 3=Europe, 4=Africa, 5=Asia, 6=Oceania
    - **bars**: Number of vertical bars in the flag
    - **stripes**: Number of horizontal stripes in the flag
    - **colours**: Number of different colours in the flag
    - red: 0 if red absent, 1 if red present in the flag

    - **mainhue**: predominant colour in the flag (tie-breaks decided by taking the topmost hue, if that fails then the most central hue, and if that fails the leftmost hue)
    - **circles**: Number of circles in the flag
    - **crosses**: Number of (upright) crosses
    - **saltires**: Number of diagonal crosses
    - **quarters**: Number of quartered sections
    - **sunstars**: Number of sun or star symbols

    We will build a decision tree classifier to predict what continent a particular flag comes from. Before that, we want to understand the distribution of flags by continent. Calcluate the count of flags by landmass value.

In [2]:
#https://archive.ics.uci.edu/ml/machine-learning-databases/flags/flag.data
cols = ['name','landmass','zone', 'area', 'population', 'language','religion','bars','stripes','colours',
'red','green','blue','gold','white','black','orange','mainhue','circles',
'crosses','saltires','quarters','sunstars','crescent','triangle','icon','animate','text','topleft','botright']
df= pd.read_csv("https://archive.ics.uci.edu/ml/machine-learning-databases/flags/flag.data", names = cols)

#variable names to use as predictors
var = [ 'red', 'green', 'blue','gold', 'white', 'black', 'orange', 'mainhue','bars','stripes', 'circles','crosses', 'saltires','quarters','sunstars','triangle','animate']

#Print number of countries by landmass, or continent
print(df.landmass.value_counts())

landmass
4    52
5    39
3    35
1    31
6    20
2    17
Name: count, dtype: int64


2. Rather than looking at all six continents, we will focus on just two, Europe and Oceania. Create a new dataframe with only flags from Europe and Oceania.


In [3]:
#Create a new dataframe with only flags from Europe and Oceania
df_36 = df[df["landmass"].isin([3,6])]


3. Given the list of predictors in the list var, print the average values of each for these two continents. Note which predictors have very different averages.

In [4]:
#Print the average vales of the predictors for Europe and Oceania
print(df_36.groupby('landmass')[var].mean().T)

TypeError: Could not convert redgoldredgoldredwhitewhiteredwhitewhitewhitegoldblackwhiteblueredbluewhitewhiteredredredredredredwhiteredredwhiteredblueredredgoldred to numeric

4. We will build a classifier to distinguish flags for these two continents – but first, inspect the variable types for each of the predictors.

    ```python
    labels = (df["landmass"].isin([3,6]))*1
    ```

In [None]:
#Create labels for only Europe and Oceania
labels = (df["landmass"].isin([3,6]))*1

#Print the variable types for the predictors
print(df[var].dtypes)

5. Note that all the predictor variables are numeric except for mainhue. Transform the dataset of predictor variables to dummy variables and save this in a new dataframe called data.

In [None]:
#Create dummy variables for categorical predictors
data = pd.get_dummies(df_36[var])

6. Split the data into a train and test set.

In [None]:
#Split data into a train and test set
train_data, test_data, train_labels, test_labels = train_test_split(data, labels, random_state=1, test_size=.4)

### Tune Decision Tree Classifiers by Depth

7. We will explore tuning the decision tree model by testing the performance over a range of `max_depth` values. Fit a decision tree classifier for `max_depth` values from 1-20. Save the accuracy score in for each depth in the list `acc_depth`.


In [None]:
#Fit a decision tree for max_depth values 1-20; save the accuracy score in acc_depth
depths = range(1, 21)
acc_depth = []

8. Plot the accuracy of the decision tree models versus the `max_depth`.

In [None]:
#Plot the accuracy vs depth
for i in depths:
    dt = DecisionTreeClassifier(random_state = 10, max_depth = i)
    dt.fit(train_data, train_labels)
    acc_depth.append(dt.score(test_data, test_labels))

9. Find the largest accuracy and the depth this occurs.


In [None]:
#Find the largest accuracy and the depth this occurs




10. Refit the decision tree model using the `max_depth` from above; plot the decision tree.

In [None]:
#Refit decision tree model with the highest accuracy and plot the decision tree



### Tune Decision Tree Classifiers by Pruning

11. Like we did with max_depth, we will now tune the tree by using the hyperparameter `ccp_alpha`, which is a pruning parameter. Fit a decision tree classifier for each value in ccp. Save the accuracy score in the list `acc_pruned`.
12. Plot the accuracy of the decision tree models versus the `ccp_alpha`.


In [None]:
# Create a new list for the accuracy values of a pruned decision tree.
# Loop through the values of ccp and append the scores to the list




13. Find the largest accuracy and the `ccp_alpha` value this occurs.

In [None]:
#Plot the accuracy vs ccp_alpha


#Find the largest accuracy and the ccp value this occurs



14. Fit a decision tree model with the values for `max_depth` and `ccp_alpha` found above. Plot the final decision tree.

In [None]:
#Fit a decision tree model with the values for max_depth and ccp_alpha found above


#Plot the final decision tree

15. Nice work! Note that the accuracy of our final model increased and the structure of the tree was simpler – many unnecessary branches were removed in the pruning process making for a much easier interpretation.

    There are a few different ways to extend this project:

    Try to classify something else! Rather than predicting the "Landmass" feature, you could predict something like the "Language"?

    Find a subset of features that work better than what we’re currently using. An important note is that a feature that has categorical data won’t work very well as a feature. For example, we don't want a decision node to split nodes based on whether the value for "Language" is above or below 5.

    Tune more parameters of the model. You can find a description of all the parameters you can tune in the Decision Tree Classifier documentation. For example, see what happens if you tune `max_leaf_nodes`. Think about whether you would be overfitting or underfitting the data based on how many leaf nodes you allow.


In [None]:
import seaborn as sns
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt
from sklearn.tree import DecisionTreeClassifier
from sklearn import tree

# Load the data and define the predictors
cols = ['name','landmass','zone', 'area', 'population', 'language','religion','bars','stripes','colours',
        'red','green','blue','gold','white','black','orange','mainhue','circles',
        'crosses','saltires','quarters','sunstars','crescent','triangle','icon','animate','text','topleft','botright']
df = pd.read_csv("https://archive.ics.uci.edu/ml/machine-learning-databases/flags/flag.data", names=cols)

var = ['red', 'green', 'blue','gold', 'white', 'black', 'orange', 'mainhue','bars','stripes',
       'circles','crosses', 'saltires','quarters','sunstars','triangle','animate']

# Print number of countries by landmass
print(df.landmass.value_counts())

# Create a new dataframe with only flags from Europe and Oceania
df_36 = df[df["landmass"].isin([3, 6])]

# Print the average values of the predictors for Europe and Oceania
print(df_36.groupby('landmass')[var].mean().T)

# Create labels for only Europe and Oceania
labels = (df["landmass"].isin([3, 6])) * 1

# Print the variable types for the predictors
print(df[var].dtypes)

# Create dummy variables for categorical predictors
data = pd.get_dummies(df_36[var])

# Split data into a train and test set
train_data, test_data, train_labels, test_labels = train_test_split(data, labels, random_state=1, test_size=.4)

# Fit a decision tree for max_depth values 1-20; save the accuracy score in acc_depth
depths = range(1, 21)
acc_depth = []

# Plot the accuracy vs depth
for i in depths:
    dt = DecisionTreeClassifier(random_state=10, max_depth=i)
    dt.fit(train_data, train_labels)
    acc_depth.append(dt.score(test_data, test_labels))

# Plot the accuracy of the decision tree models versus the max_depth
plt.plot(depths, acc_depth)
plt.xlabel("max_depth")
plt.ylabel("Accuracy")
plt.title("Accuracy vs max_depth")
plt.show()

# Find the largest accuracy and the depth this occurs
max_accuracy = max(acc_depth)
best_depth = acc_depth.index(max_accuracy) + 1
print("Best accuracy:", max_accuracy)
print("Best depth:", best_depth)

# Refit decision tree model with the highest accuracy and plot the decision tree
dt_best = DecisionTreeClassifier(random_state=10, max_depth=best_depth)
dt_best.fit(train_data, train_labels)

plt.figure(figsize=(10, 8))
tree.plot_tree(dt_best, feature_names=data.columns, class_names=["Other", "Europe/Oceania"], filled=True)
plt.show()

# Create a new list for the accuracy values of a pruned decision tree
acc_pruned = []

# Loop through the values of ccp_alpha and append the scores to the list
ccp_alphas = np.linspace(0, 0.1, 100)
for alpha in ccp_alphas:
    dt = DecisionTreeClassifier(random_state=10, ccp_alpha=alpha)
    dt.fit(train_data, train_labels)
    acc_pruned.append(dt.score(test_data, test_labels))

# Plot the accuracy of the decision tree models versus the ccp_alpha
plt.plot(ccp_alphas, acc_pruned)
plt.xlabel("ccp_alpha")
plt.ylabel("Accuracy")
plt.title("Accuracy vs ccp_alpha")
plt.show()

# Find the largest accuracy and the ccp_alpha value this occurs
max_accuracy = max(acc_pruned)
best_alpha = ccp_alphas[acc_pruned.index(max_accuracy)]
print("Best accuracy:", max_accuracy)
print("Best ccp_alpha:", best_alpha)

# Fit a decision tree model with the values for max_depth and ccp_alpha found above
dt_final = DecisionTreeClassifier(random_state=10, max_depth=best_depth, ccp_alpha=best_alpha)
dt_final.fit(train_data, train_labels)

# Plot the final decision tree
plt.figure(figsize=(10, 8))
tree.plot_tree(dt_final, feature_names=data.columns, class_names=["Other", "Europe/Oceania"], filled=True)
plt.show()
