# TD 2022_11_17

## Prédiction du vote 2016 aux Etats-Unis par arbres de Décisions et Méthodes Ensemblistes

La séance d'aujourd'hui porte sur la prévision du vote en 2016 aux États-Unis. Précisément, les données d'un recensement sont fournies avec diverses informations par comté à travers les États-Unis. L'objectif est de construire des prédicteurs de leur couleur politique:
- *républicain:* **red**;
- *démocrate:* **bleu**;

Exécuter les commandes suivantes pour charger l'environnement.

In [2]:
%matplotlib inline
from pylab import *
import numpy as np
import os
import random
import matplotlib.pyplot as plt

# Accès Données

* Elles sont disponibles sur le eCampus ou sur TEAMS
* Charger le fichier the combined_data.csv sur votre drive puis monter le depuis colab


In [3]:
USE_COLAB = True
UPLOAD_OUTPUTS = False
if USE_COLAB:
    # mount the google drive
    from google.colab import drive
    drive.mount('/content/drive', force_remount=True)
    # download data on GoogleDrive
    data_dir = "/content/drive/My Drive/ENSTA/MI201/2022_11_17/"
else:
    data_dir = "data/"

Mounted at /content/drive


In [4]:
import pandas as pd

census_data = pd.read_csv( os.path.join(data_dir, 'combined_data.csv') )

# Analyse Préliminaire Données

Les données sont organisées en champs:
* fips = code du comté à 5 chiffres:
 * `00xxx`: state code;

* votes = nombre de votants

* etc... 
 * `census_data.columns.values` return the columns of the data

Regarder leur structure, quantité, nature.

1. Où se trouvent les informations pour former les ensembles d'apprentissage et de test?
 * `census_data`: 
    1. `y = census_data['Democrat']`: resulting vote;
    2. `X = census_data.loc[:, 'votes':'voter_turnout_rate']`:
      * `fips` are removed, location should not influence the prediction 

2. Où se trouvent les classes à prédire?
 * `'Democrat'` is the desired outcome;

3. Visualiser quelques distributions.
 * `todo`

## Reference
Data Frame:
https://pandas.pydata.org/pandas-docs/stable/reference/frame.html


In [5]:
census_data
# census_data.loc[:, 'votes':'voter_turnout_rate']

Unnamed: 0,fips,votes,"Percent of adults with less than a high school diploma, 2011-2015","Percent of adults with a high school diploma only, 2011-2015","Percent of adults completing some college or associate's degree, 2011-2015","Percent of adults with a bachelor's degree or higher, 2011-2015",Unemployment_rate_2015,POP_ESTIMATE_2015,Amish,Buddhist,...,HAWAIIAN_PACIFIC_FEMALE_rate,MULTI_MALE_rate,MULTI_FEMALE_rate,WHITE_rate,BLACK_rate,NATIVE_AMERICAN_rate,HAWAIIAN_PACIFIC_rate,MULTI_rate,voter_turnout_rate,Democrat
0,2013,7471,18.3,39.4,28.2,14.0,3.2,3341.0,0,0,...,0.001131,0.012374,0,0.757510,0.129665,0.011954,0.002299,0.012374,2.236157,0
1,2016,7471,16.0,37.0,32.2,14.7,3.8,5702.0,0,0,...,0.001131,0.012374,0,0.757510,0.129665,0.011954,0.002299,0.012374,1.310242,0
2,2020,7471,7.0,24.1,35.6,33.2,5.0,298695.0,0,13,...,0.001131,0.012374,0,0.757510,0.129665,0.011954,0.002299,0.012374,0.025012,0
3,2050,7471,21.0,43.6,23.7,11.6,14.4,17946.0,0,0,...,0.001131,0.012374,0,0.757510,0.129665,0.011954,0.002299,0.012374,0.416304,0
4,2060,7471,8.0,30.7,41.3,20.0,9.2,892.0,0,0,...,0.001131,0.012374,0,0.757510,0.129665,0.011954,0.002299,0.012374,8.375561,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3140,56037,16661,9.5,35.4,35.5,19.6,4.6,44626.0,0,0,...,0.000829,0.010241,0,0.941536,0.014498,0.012907,0.001972,0.010241,0.373347,0
3141,56039,12176,4.1,17.9,24.1,53.9,3.8,23125.0,0,0,...,0.000605,0.007395,0,0.952216,0.007524,0.010724,0.001341,0.007395,0.526530,1
3142,56041,8053,10.3,36.2,34.2,19.3,4.9,20822.0,0,0,...,0.001105,0.009605,0,0.953991,0.007684,0.012679,0.002497,0.009605,0.386754,0
3143,56043,3715,12.6,29.3,37.0,21.1,4.0,8328.0,1,0,...,0.000720,0.009966,0,0.947406,0.007325,0.017291,0.001321,0.009966,0.446085,0


In [6]:
def dataInfo():
  print(f"Données   = {census_data.shape[0]}")
  print(f"Attributs = {census_data.shape[1]} total")
  print(f"Attributs = {census_data.shape[1]-2} usefull")
  print("\n")
  # where the 'fips' is removed, location should not influence the result
  # where the 'Democrat' is removed, result is necessary for trainning / testing
  
  return


def main():
  dataInfo()

  return



main()

Données   = 3145
Attributs = 78 total
Attributs = 76 usefull




La classe à prédire ('Democrat') n'est décrite que par un seul attribut binaire.
Calculer la répartition des couleurs politiques (quel est a priori la probabilité qu'un comté soit démocrate vs. républicain)

In [7]:
def meanAttr(attribute: str) -> float:
  return np.array(census_data[attribute]).mean()

meanDem = meanAttr('Democrat')


print(f"Probabilité Démocrate: {meanDem*100:.2f} %")

Probabilité Démocrate: 15.45 %


# Préparation Apprentissage

On va préparer les ensembles d'apprentissage et de test. 

Pour éviter des problèmes de format de données, on choisit une liste d'attributs utiles dans la liste "feature_cols" ci dessous.

L'ensemble de test sera constitué des comtés d'un seul état.

## Reference
Info: https://scikit-learn.org/stable/model_selection.html

FIPS:
https://en.wikipedia.org/wiki/federal_Information_Processing_Standard_state_code



In [10]:
feature_cols = ['BLACK_FEMALE_rate', 
                'BLACK_MALE_rate',
                'Percent of adults with a bachelor\'s degree or higher, 2011-2015',
                'ASIAN_MALE_rate',
                'ASIAN_FEMALE_rate',
                '25-29_rate',
                'age_total_pop',
                '20-24_rate',
                'Deep_Pov_All',
                '30-34_rate',
                'Density per square mile of land area - Population',
                'Density per square mile of land area - Housing units',
                'Unemployment_rate_2015',
                'Deep_Pov_Children',
                'PovertyAllAgesPct2014',
                'TOT_FEMALE_rate',
                'PerCapitaInc',
                'MULTI_FEMALE_rate',
                '35-39_rate',
                'MULTI_MALE_rate',
                'Percent of adults completing some college or associate\'s degree, 2011-2015',
                '60-64_rate',
                '55-59_rate',
                '65-69_rate',
                'TOT_MALE_rate',
                '85+_rate',
                '70-74_rate',
                '80-84_rate',
                '75-79_rate',
                'Percent of adults with a high school diploma only, 2011-2015',
                'WHITE_FEMALE_rate',
                'WHITE_MALE_rate',
                'Amish',
                'Buddhist',
                'Catholic',
                'Christian Generic',
                'Eastern Orthodox',
                'Hindu',
                'Jewish',
                'Mainline Christian',
                'Mormon',
                'Muslim',
                'Non-Catholic Christian',
                'Other',
                'Other Christian',
                'Other Misc',
                'Pentecostal / Charismatic',
                'Protestant Denomination',
                'Zoroastrian']
print(size(feature_cols))

49


In [35]:
def county_data(census_data, fips_code=17):
  #fips_code
  # 48=Texas
  # 34=New Jersey
  # 31=Nebraska
  # 17=Illinois
  # 06=California
  # 36=New York
  mask = census_data['fips'].between(fips_code*1000, fips_code*1000 + 999)
  census_data_train = census_data[~mask]
  census_data_test = census_data[mask]


  XTrain = census_data_train[feature_cols]
  yTrain = census_data_train['Democrat']
  XTest = census_data_test[feature_cols]
  yTest = census_data_test['Democrat']

  return XTrain, yTrain, XTest, yTest

STATE_FIPS_CODE = 48
X_train, y_train, X_test, y_test = county_data(census_data, STATE_FIPS_CODE)

#print(X_train.head(2))
#print(y_test.head(2))


# Apprentissage Arbre de Décision

On utilisera la bibliothèque scikit learn 

* Construire l'arbre sur les données d'entrainement
* Prédire le vote sur les comtés de test
* Calculer l'erreur et la matrice de confusion

Faire varier certains paramètres (profondeur max, pureté, critère...) et visualisez leur influence.


## Reference
Info: https://scikit-learn.org/stable/modules/tree.html

Info: https://scikit-learn.org/stable/modules/model_evaluation.html


In [36]:
from sklearn import tree

clf = tree.DecisionTreeClassifier()
clf = clf.fit(X_train, y_train)

Les instructions suivantes permettent de visualiser l'arbre.
Interpréter le contenu de la représentation.

In [None]:
import graphviz

dot_data = tree.export_graphviz(clf, out_file=None) 
graph = graphviz.Source(dot_data) 

dot_data = tree.export_graphviz(clf, out_file=None, 
                     max_depth = 2,
                     feature_names=X_train.columns.values,  
                     class_names=["R","D"],  
                     filled=True, rounded=True,  
                     special_characters=True)  
graph = graphviz.Source(dot_data)  
graph

In [None]:
clf.predict(X_test)

In [39]:
y_pred = clf.predict(X_test)
y_list = list(y_test)


def meanPred() -> float:

  n = size(y_test)
  sum = 0

  for i in range(n):
    if y_list[i] == y_pred[i]:
      sum += 1

  return sum/n


print(f"Probabilité Démocrate: {meanPred()*100:.2f} %")

Probabilité Démocrate: 89.76 %



---

# Bagging

L'objectif de cette partie est de construire **à la main** une approche de bagging.

Le principe de l'approche est de:

* Apprendre et collecter plusieurs arbres sur des échantillonnages aléatoires des données d'apprentissage
* Agréger les prédictions par vote 
* Evaluer: Les prédictions agrégées
* Comparer avec les arbres individuels et le résultat précédent


Utiliser les fonctions de construction d'ensemble d'apprentissage/test de scikit-learn [train_test_split](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html) pour générer les sous-esnembles échantillonnés.

**Comparer après le cours** les fonctions de scikit-learn [ensemble](https://scikit-learn.org/stable/modules/ensemble.html)

## Reference
Numpy tips: [np.arange](https://docs.scipy.org/doc/numpy-1.15.0/reference/generated/numpy.arange.html), [numpy.sum](https://docs.scipy.org/doc/numpy-1.15.0/reference/generated/numpy.sum.html), [numpy.mean](https://docs.scipy.org/doc/numpy-1.15.1/reference/generated/numpy.mean.html), [numpy.where](https://docs.scipy.org/doc/numpy-1.15.0/reference/generated/numpy.where.html)



In [43]:
from sklearn.model_selection import train_test_split
from sklearn import tree

# Données d'apprentissage: X_train, y_train, idx_train
# Données de test: X_test, y_test, idx_test
# Les étapes de conception du prédicteur (apprentissage) sont les suivantes:
#   - Construction des sous-ensembles de données
#   - Apprentissage d'un arbre
#   - Agrégation de l'arbre dans la forêt
#
# Pour le test

def dataSplit(X, y, stateNumber: int = 42, splitRatio: int = 0.8):

  X_trn, X_tst, y_trn, y_tst = train_test_split(X, y, 
                                                random_state=stateNumber,
                                                stratify=y,
                                                test_size=splitRatio)

  return X_trn, X_tst, y_trn, y_tst 


def learn_forest(XTrain, yTrain, nb_trees, depth=15):
  forest = []
  singleperf=[]

  for ss in range(nb_trees):
    clf = tree.DecisionTreeClassifier()
    clf = clf.fit(XTrain, yTrain)

    dot_data = tree.export_graphviz(clf, out_file=None, 
                                    max_depth = depth,
                                    feature_names=XTrain.columns.values, 
                                    class_names=["R","D"], 
                                    filled=True, rounded=True, 
                                    special_characters=True)
    singleperf.append(clf)
    forest.append(dot_data)
    forest.append(clf)

  
  return forest, singleperf



In [41]:
def predict_forest(forest, XTest, yTest = None):
  
  nb_trees = len(forest)
  all_preds = []
  singleperf = []
  final_pred = []

  for ss in range(nb_trees):
    clf = forest[ss]
  
    y_pred = clf.predict(XTest)

    final_pred.append(y_pred)


  if (yTest is not None):
    return final_pred, singleperf
  else:
    return final_pred


In [44]:
#########################
## METTRE VOTRE CODE ICI
#########################


X_train, y_train, X_test, y_test = county_data(census_data, 6)

F,singleperf = learn_forest(X_train, y_train, 20, depth=15)
pred, singleperftest = predict_forest(F, X_test, y_test)
acc = perf.balanced_accuracy_score( y_test, pred )
print("Taux de bonne prédiction = {:.2f}%".format(100*acc))
print(mean(singleperftest))
#print(singleperftest)
#print(singleperf)

AttributeError: ignored

# Reference
Here there is some usefull code for external usage:

In [None]:
# ==================
# DATA VISUALIZATION
# ==================

def printData():
  print("census_data:");         print(census_data.shape);          print("\n")
  print("census_data.columns:"); print(census_data.columns.values); print("\n")
  print("census_data['fips']");  print(census_data['fips']);        print("\n")
  print("census_data.head(n)");  print(census_data.head(200));      print("\n")

  return


def meanAttribute(column: int):
  attributeName = census_data.columns[column]             # columns start at 0
  attributeMean = np.array(census_data[attributeName]).mean()
  print(f"Mean of {attributeName} is {attributeMean:.4f}")

  return


def dataSplit():
  from sklearn.model_selection import train_test_split

  X = census_data.loc[:, 'votes':'voter_turnout_rate']
  y = census_data['Democrat']

  X_train, X_test, y_train, y_test = train_test_split(X, y, 
                                                      random_state=42,
                                                      stratify=y,
                                                      test_size=0.8)

  print(f"X : {X.shape}\ny : {y.shape}")
  print(f"X_train : {X_train.shape}\ny_train : {y_train.shape}")
  print(f"X_test : {X_test.shape}\ny_test : {y_test.shape}")
  print(f"lines : {X_train.shape[0] + X_test.shape[0]}")
  print("\n")
  # dimension = 1 when empty

  return X_train, X_test, y_train, y_test 