# Text Classification

In dieser Aufgabe werden Sie eine Text Classification Pipeline bauen, die Partei gegeben einen Text vorhersagt. 
Statt der Parlamentsdebatten koennen Sie auch gerne einen anderen Text Datensatz nehmen, wenn Sie einen guten finden.
Stellen Sie aber sicher, dass Ihre Kollegen Zugriff auf die Daten haben fuer die Korrektur. 

In [1]:
import os, gzip
import pandas as pd
import numpy as np
import urllib.request

import warnings
warnings.filterwarnings('ignore')

DATADIR = "data"

if not os.path.exists(DATADIR): 
    os.mkdir(DATADIR)

file_name = os.path.join(DATADIR, 'bundestags_parlamentsprotokolle.csv.gzip')
if not os.path.exists(file_name):
    url_data = 'https://www.dropbox.com/s/1nlbfehnrwwa2zj/bundestags_parlamentsprotokolle.csv.gzip?dl=1'
    urllib.request.urlretrieve(url_data, file_name)

df = pd.read_csv(gzip.open(file_name), index_col=0).sample(frac=1)
# labels
parteien = df.partei.unique()
# total data length
# df.shape[0]

Ein Auszug der Parlamentsdebatten

In [2]:
df[:4]

Unnamed: 0,sitzung,wahlperiode,sprecher,text,partei
17300,195,17,Dr. Frank Steffel,Und sie gilt international als Auslaufmodell. ...,cducsu
5400,71,17,Dirk Becker,Mit dem vorliegenden Antrag der SPD beschäftig...,spd
39586,186,18,Dagmar G. Wöhrl,Auch im Libanon werden die Spannungen angesich...,cducsu
36096,144,18,Norbert Müller,Wir haben ein ganzes Bündel von Maßnahmen und ...,linke


Splitten Sie die Daten in Train (80%) und Test (20%), dafür koennen sie die sklearn train_test_split function benutzen. 

Dann trainieren Sie eine Pipeline mit einem geeigneten Vectorizer und einem sklearn Modell Ihrer Wahl. 

Vergleichen Sie die Precision/Recall/F1 und Accuracy auf dem Train und Test set. 

In [3]:
def accuracyPerClass (t_labels,predictions, labels, print_res=0):
    # calculate accuracy result for each label in prediction result
    # t_labals: train of test labels: prediction results
    # predictions
    cmat = confusion_matrix(t_labels, predictions)
    accuracy_arr = np.array(cmat.diagonal()/cmat.sum(axis=1))
    if print_res==1:
        #print(cmat)
        #print(accuracy_arr)
        print('accuracy: ', end=' ')
        for idx,label in enumerate(labels):
            print(label, np.around(accuracy_arr[idx],decimals=3), end=' ')

    return accuracy_arr

In [4]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.metrics import confusion_matrix, classification_report
# import for accuracy
from sklearn.metrics import accuracy_score

# example types

# 1.1.1. Ordinary Least Squares
# 1.6.5. Nearest Centroid Classifier
#from sklearn.neighbors import NearestCentroid
# 1.1.12. Stochastic Gradient Descent - SGD
#from sklearn.linear_model import SGDClassifier

# new exercise test classifiers
# error
# from sklearn.linear_model import LinearRegression
#
# logistic model
from sklearn.linear_model import LogisticRegression
# 
from sklearn.svm import SVC

In [5]:
# Put some data aside for model evaluation
train_data, test_data, train_labels, test_labels = train_test_split(df['text'], df['partei'], test_size=0.2)

# error!
#linreg_clf = Pipeline([('vect', TfidfVectorizer(max_features=int(1e8))),
#                            ('clf', LinearRegression())]).fit(train_data, train_labels)

logreg_clf = Pipeline([('vect', TfidfVectorizer(max_features=int(1e8))),
                        ('clf', LogisticRegression())]).fit(train_data, train_labels)

# error not feasible!
#svc_clf = Pipeline([('vect', TfidfVectorizer(max_features=int(1e8))),
#                        ('clf', SVC())]).fit(train_data, train_labels)

## Evaluating Logistic Regression on Training Data

In [6]:
logreg_predictions = logreg_clf.predict(train_data)
print(classification_report(logreg_predictions, train_labels))
print('accuracy: ',np.around(accuracy_score(train_labels,logreg_predictions),decimals=3))
accuracyPerClass (train_labels,logreg_predictions, parteien, 1);

              precision    recall  f1-score   support

      cducsu       0.94      0.70      0.80     17325
         fdp       0.16      0.97      0.28       459
      gruene       0.61      0.86      0.72      3583
       linke       0.77      0.86      0.81      4398
         spd       0.73      0.75      0.74      9178

   micro avg       0.75      0.75      0.75     34943
   macro avg       0.64      0.83      0.67     34943
weighted avg       0.82      0.75      0.77     34943

accuracy:  0.752
accuracy:  cducsu 0.943 spd 0.163 linke 0.615 gruene 0.772 fdp 0.726 

## Evaluating Logistic Regression on Test Data

In [7]:
logreg_predictions = logreg_clf.predict(test_data)
print(classification_report(logreg_predictions, test_labels))
print('accuracy: ',np.around(accuracy_score(test_labels,logreg_predictions),decimals=3))
accuracyPerClass (test_labels,logreg_predictions, parteien, 1);

              precision    recall  f1-score   support

      cducsu       0.87      0.60      0.71      4653
         fdp       0.06      0.83      0.11        48
      gruene       0.35      0.59      0.44       733
       linke       0.58      0.68      0.63      1045
         spd       0.52      0.55      0.53      2257

   micro avg       0.60      0.60      0.60      8736
   macro avg       0.48      0.65      0.48      8736
weighted avg       0.70      0.60      0.63      8736

accuracy:  0.598
accuracy:  cducsu 0.87 spd 0.061 linke 0.346 gruene 0.581 fdp 0.519 

## Evaluating SVC on Training Data

In [8]:
# error result not feasible
#svc_predictions = svc_clf.predict(train_data)
#print(classification_report(svc_predictions, train_labels))
#accuracyPerClass (train_labels,svc_predictions, parteien, 1);

#               precision    recall  f1-score   support

#       cducsu       1.00      0.37      0.54     34943
#          fdp       0.00      0.00      0.00         0
#       gruene       0.00      0.00      0.00         0
#        linke       0.00      0.00      0.00         0
#          spd       0.00      0.00      0.00         0

#    micro avg       0.37      0.37      0.37     34943
#    macro avg       0.20      0.07      0.11     34943
# weighted avg       1.00      0.37      0.54     34943

# accuracy:  cducsu 1.0 spd 0.0 fdp 0.0 linke 0.0 gruene 0.0 

## Evaluating SVC on Test Data

In [9]:
# error result not feasible
#svc_predictions = svc_clf.predict(test_data)
#print(classification_report(svc_predictions, test_labels))
#accuracyPerClass (test_labels,svc_predictions, parteien, 1);

In [10]:
# tests and comments how to use confusion matrix

#import pdb; pdb.set_trace()

# not working
#for partei in parteien:
#    print(partei)
#    accuracy_score(train_labels[\'spd'],logreg_predictions[\'spd'])
    
# accuracy_score(train_labels['spd'],logreg_predictions['spd'])    

# how to use confusion matrix
# from https://scikit-learn.org/stable/modules/generated/sklearn.metrics.confusion_matrix.html
# y_true = ["cat", "ant", "cat", "cat", "ant", "bird"]
# y_pred = ["ant", "ant", "cat", "cat", "ant", "cat"]
# confusion_matrix(y_true, y_pred, labels=["ant", "bird", "cat"])
# array([[2, 0, 0],
#        [0, 0, 1],
#        [1, 0, 2]])
# print('diagonal, label correct predicted')
# print('1st col for predicted ants: lower left corner, ant was detected at cat')

# cmat = confusion_matrix(train_labels, logreg_predictions, labels=['cducsu'])
# cmat = confusion_matrix(train_labels, logreg_predictions, labels=['cducsu', 'gruene', 'spd', 'linke', 'fdp'])

In [11]:
# from sklearn.metrics import f1_score
# f1_score?
