## Modellauswahl
In diesem Kapitel werden Experimente durchgeführt, um das beste Modell auszuwählen.

### Test 1: Baseline
Im ersten Test wird ein Baseline-Fall herangezogen, wobei keine Datenvorverarbeitung vorgenommen wird. Zur Feature Extraction wird das Bag-of-Words-Modell verwendet. Algorithmen wie die logistische Regression, die Support Vector Machines und das Naïve- Bayes-Modell werden für diesen Test ausgewertet. Das ausgewählte N-Gramme für die Baseline ist Unigramme.

In [14]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.linear_model import Log8483ticRegression
from sklearn.svm import LinearSVC
from sklearn.naive_bayes import MultinomialNB, BernoulliNB
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
plt.style.use('fivethirtyeight')

cols = ['sentiment', 'id', 'date','query_string', 'user', 'text']
df2 = pd.read_csv("training.1600000.processed.noemoticon.csv",encoding = "ISO-8859-1",header=None, names=cols)
df2['sentiment'] = df2['sentiment'].map({0: 0, 4: 1})
df2['text'].replace('', np.nan, inplace=True)

x = df2.text
y = df2.sentiment

SEED = 2000
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=.05, random_state=SEED)

In [15]:
# Zur Auswertung anderen Methoden können sie jeweils bei 'classifier' eingesetzt werden
cvec = CountVectorizer()
classifier = LinearSVC()

# numerische Transformation durch BoW-Modell
cvec.fit(x_train)
x_train= cvec.transform(x_train)
x_test = cvec.transform(x_test)
sentiment_fit = classifier.fit(x_train, y_train)
y_pred = sentiment_fit.predict(x_test)
print(classifier)
print("Accuracy: {0:.2f}%".format(accuracy_score(y_test, y_pred)*100))
print("-"*80)
print("Evaluationsmaße\n")
print(classification_report(y_test, y_pred, digits=4))
print('Anzahl von Features: ', len(cvec.get_feature_names()))



LinearSVC()
Accuracy: 78.84%
--------------------------------------------------------------------------------
Evaluationsmaße

              precision    recall  f1-score   support

           0     0.7907    0.7878    0.7892     40237
           1     0.7861    0.7889    0.7875     39763

    accuracy                         0.7884     80000
   macro avg     0.7884    0.7884    0.7884     80000
weighted avg     0.7884    0.7884    0.7884     80000

Anzahl von Features:  661318


In [16]:
x = df2.text
y = df2.sentiment

SEED = 2000
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=.05, random_state=SEED)
cvec = CountVectorizer()
classifier = MultinomialNB()

# numerische Transformation durch BoW-Modell
cvec.fit(x_train)
x_train= cvec.transform(x_train)
x_test = cvec.transform(x_test)
sentiment_fit = classifier.fit(x_train, y_train)
y_pred = sentiment_fit.predict(x_test)
print(classifier)
print("Accuracy: {0:.2f}%".format(accuracy_score(y_test, y_pred)*100))
print("-"*80)
print("Evaluationsmaße\n")
print(classification_report(y_test, y_pred, digits=4))

MultinomialNB()
Accuracy: 78.20%
--------------------------------------------------------------------------------
Evaluationsmaße

              precision    recall  f1-score   support

           0     0.7644    0.8189    0.7907     40237
           1     0.8025    0.7446    0.7725     39763

    accuracy                         0.7820     80000
   macro avg     0.7835    0.7818    0.7816     80000
weighted avg     0.7834    0.7820    0.7817     80000

Anzahl von Features:  661318


### Test 2.1: Datenvorverarbeitung & BoW
In diesem Test findet die Datenvorverarbeitung statt.

In [5]:
# Ebenso werden verschiedene Methoden ausgewertet
csv = 'clean_tweet_sw.csv'
my_df = pd.read_csv(csv)
my_df.dropna(inplace=True)
my_df.reset_index(drop=True,inplace=True)
x = my_df.text
y = my_df.sentiment

SEED = 2000
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=.05, random_state=SEED)

In [6]:
cvec = CountVectorizer()
classifier = LogisticRegression()
cvec.fit(x_train)
x_train= cvec.transform(x_train)
x_test = cvec.transform(x_test)
sentiment_fit = classifier.fit(x_train, y_train)
y_pred = sentiment_fit.predict(x_test)
print(classifier)
print("Accuracy: {0:.2f}%".format(accuracy_score(y_test, y_pred)*100))
print("-"*80)
print("Evaluationsmaße\n")
print(classification_report(y_test, y_pred, digits=4))
print('Anzahl von Features: ', len(cvec.get_feature_names()))

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


LogisticRegression()
Accuracy: 77.69%
--------------------------------------------------------------------------------
Evaluationsmaße

              precision    recall  f1-score   support

           0     0.7909    0.7531    0.7715     39806
           1     0.7642    0.8008    0.7820     39782

    accuracy                         0.7769     79588
   macro avg     0.7775    0.7769    0.7768     79588
weighted avg     0.7775    0.7769    0.7768     79588

Anzahl von Features:  262511


In [22]:
csv = 'clean_tweet_sw.csv'
my_df = pd.read_csv(csv)
my_df.dropna(inplace=True)
my_df.reset_index(drop=True,inplace=True)

SEED = 2000
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=.05, random_state=SEED)
classifier = LinearSVC()
cvec = CountVectorizer()
# numerische Transformation durch BoW-Modell
cvec.fit(x_train)
x_train= cvec.transform(x_train)
x_test = cvec.transform(x_test)
sentiment_fit = classifier.fit(x_train, y_train)
y_pred = sentiment_fit.predict(x_test)
print(classifier)
print("Accuracy: {0:.2f}%".format(accuracy_score(y_test, y_pred)*100))
print("-"*80)
print("Evaluationsmaße\n")
print(classification_report(y_test, y_pred, digits=4))
print('Anzahl von Features: ', len(cvec.get_feature_names()))



LinearSVC()
Accuracy: 77.03%
--------------------------------------------------------------------------------
Evaluationsmaße

              precision    recall  f1-score   support

           0     0.7834    0.7472    0.7649     39806
           1     0.7583    0.7933    0.7754     39782

    accuracy                         0.7703     79588
   macro avg     0.7708    0.7703    0.7701     79588
weighted avg     0.7708    0.7703    0.7701     79588

Anzahl von Features:  262511


In [23]:
csv = 'clean_tweet_sw.csv'
my_df = pd.read_csv(csv)
my_df.dropna(inplace=True)
my_df.reset_index(drop=True,inplace=True)

SEED = 2000
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=.05, random_state=SEED)
cvec = CountVectorizer()
classifier = MultinomialNB()

# numerische Transformation durch BoW-Modell
cvec.fit(x_train)
x_train= cvec.transform(x_train)
x_test = cvec.transform(x_test)
sentiment_fit = classifier.fit(x_train, y_train)
y_pred = sentiment_fit.predict(x_test)
print(classifier)
print("Accuracy: {0:.2f}%".format(accuracy_score(y_test, y_pred)*100))
print("-"*80)
print("Evaluationsmaße\n")
print(classification_report(y_test, y_pred, digits=4))
print('Anzahl von Features: ', len(cvec.get_feature_names()))

MultinomialNB()
Accuracy: 76.74%
--------------------------------------------------------------------------------
Evaluationsmaße

              precision    recall  f1-score   support

           0     0.7633    0.7755    0.7693     39806
           1     0.7717    0.7594    0.7655     39782

    accuracy                         0.7674     79588
   macro avg     0.7675    0.7674    0.7674     79588
weighted avg     0.7675    0.7674    0.7674     79588

Anzahl von Features:  262511


### Test 2.2: Datenvorverarbeitung & TF-IDF
Anstelle BoW-Modell wird die Auswirkung der TF-IDF-Methode untersucht.

In [19]:
csv = 'clean_tweet_sw.csv'
my_df = pd.read_csv(csv)
my_df.dropna(inplace=True)
my_df.reset_index(drop=True,inplace=True)
x = my_df.text
y = my_df.sentiment

SEED = 2000
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=.05, random_state=SEED)

In [20]:
tvec = TfidfVectorizer()
classifier = LogisticRegression()
tvec.fit(x_train)
x_train= tvec.transform(x_train)
x_test = tvec.transform(x_test)
sentiment_fit = classifier.fit(x_train, y_train)
y_pred = sentiment_fit.predict(x_test)
print(classifier)
print("Accuracy: {0:.2f}%".format(accuracy_score(y_test, y_pred)*100))
print("-"*80)
print("Evaluationsmaße\n")
print(classification_report(y_test, y_pred, digits=4))
print('Anzahl von Features: ', len(tvec.get_feature_names()))

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


LogisticRegression()
Accuracy: 77.92%
--------------------------------------------------------------------------------
Evaluationsmaße

              precision    recall  f1-score   support

           0     0.7902    0.7604    0.7750     39806
           1     0.7690    0.7980    0.7832     39782

    accuracy                         0.7792     79588
   macro avg     0.7796    0.7792    0.7791     79588
weighted avg     0.7796    0.7792    0.7791     79588

Anzahl von Features:  262511


In [24]:
csv = 'clean_tweet_sw.csv'
my_df = pd.read_csv(csv)
my_df.dropna(inplace=True)
my_df.reset_index(drop=True,inplace=True)
x = my_df.text
y = my_df.sentiment

x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=.05, random_state=SEED)
tvec = TfidfVectorizer()
classifier = LinearSVC()
tvec.fit(x_train)
x_train= tvec.transform(x_train)
x_test = tvec.transform(x_test)
sentiment_fit = classifier.fit(x_train, y_train)
y_pred = sentiment_fit.predict(x_test)
print(classifier)
print("Accuracy: {0:.2f}%".format(accuracy_score(y_test, y_pred)*100))
print("-"*80)
print("Evaluationsmaße\n")
print(classification_report(y_test, y_pred, digits=4))
print('Anzahl von Features: ', len(tvec.get_feature_names()))

LinearSVC()
Accuracy: 77.24%
--------------------------------------------------------------------------------
Evaluationsmaße

              precision    recall  f1-score   support

           0     0.7820    0.7554    0.7685     39806
           1     0.7633    0.7894    0.7761     39782

    accuracy                         0.7724     79588
   macro avg     0.7727    0.7724    0.7723     79588
weighted avg     0.7727    0.7724    0.7723     79588

Anzahl von Features:  262511


In [25]:
csv = 'clean_tweet_sw.csv'
my_df = pd.read_csv(csv)
my_df.dropna(inplace=True)
my_df.reset_index(drop=True,inplace=True)
x = my_df.text
y = my_df.sentiment

x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=.05, random_state=SEED)

tvec = TfidfVectorizer()
classifier = MultinomialNB()
tvec.fit(x_train)
x_train= tvec.transform(x_train)
x_test = tvec.transform(x_test)
sentiment_fit = classifier.fit(x_train, y_train)
y_pred = sentiment_fit.predict(x_test)
print(classifier)
print("Accuracy: {0:.2f}%".format(accuracy_score(y_test, y_pred)*100))
print("-"*80)
print("Evaluationsmaße\n")
print(classification_report(y_test, y_pred, digits=4))
print('Anzahl von Features: ', len(tvec.get_feature_names()))

MultinomialNB()
Accuracy: 76.27%
--------------------------------------------------------------------------------
Evaluationsmaße

              precision    recall  f1-score   support

           0     0.7587    0.7706    0.7646     39806
           1     0.7668    0.7547    0.7607     39782

    accuracy                         0.7627     79588
   macro avg     0.7628    0.7627    0.7627     79588
weighted avg     0.7628    0.7627    0.7627     79588

Anzahl von Features:  262511


### Test 3: Datenvorverarbeitung und TF-IDF (mit Stoppwörtern)
Im Vergleich mit dem Testfall 1 – Baseline verschlechtern sich die Ergebnisse in Testfall 2.1 und Testfall 2.2, wenn eine Datenvorverarbeitung stattfindet. Daher wird es in diesem Test untersucht, ob Stoppwörter eine Auswirkung auf die Klassifizierungswerte haben. Bei der Datenvorverarbeitung werden zwei verschiedene Datein abgespeichert. Hier wird die Datei 'clean_tweet.csv' verwendet.

In [26]:
# bereinigte Daten importieren und aufsplitten
SEED = 26105111
csv = 'clean_tweet.csv'
my_df = pd.read_csv(csv)
my_df.dropna(inplace=True)
my_df.reset_index(drop=True,inplace=True)

x = my_df.text
y = my_df.sentiment
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=.05, random_state=SEED)

tvec = TfidfVectorizer()
classifier = LogisticRegression()
tvec.fit(x_train)
x_train= tvec.transform(x_train)
x_test = tvec.transform(x_test)
sentiment_fit = classifier.fit(x_train, y_train)
y_pred = sentiment_fit.predict(x_test)
print(classifier)
print("Accuracy: {0:.2f}%".format(accuracy_score(y_test, y_pred)*100))
print("-"*80)
print("Evaluationsmaße\n")
print(classification_report(y_test, y_pred, digits=4))
print('Anzahl von Features: ', len(tvec.get_feature_names()))

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


LogisticRegression()
Accuracy: 80.00%
--------------------------------------------------------------------------------
Evaluationsmaße

              precision    recall  f1-score   support

           0     0.8077    0.7881    0.7978     39951
           1     0.7927    0.8120    0.8023     39877

    accuracy                         0.8000     79828
   macro avg     0.8002    0.8001    0.8000     79828
weighted avg     0.8002    0.8000    0.8000     79828

Anzahl von Features:  262636


In [27]:
SEED = 26105111
csv = 'clean_tweet.csv'
my_df = pd.read_csv(csv)
my_df.dropna(inplace=True)
my_df.reset_index(drop=True,inplace=True)

x = my_df.text
y = my_df.sentiment
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=.05, random_state=SEED)

tvec = TfidfVectorizer()
classifier = LinearSVC()
tvec.fit(x_train)
x_train= tvec.transform(x_train)
x_test = tvec.transform(x_test)
sentiment_fit = classifier.fit(x_train, y_train)
y_pred = sentiment_fit.predict(x_test)
print(classifier)
print("Accuracy: {0:.2f}%".format(accuracy_score(y_test, y_pred)*100))
print("-"*80)
print("Evaluationsmaße\n")
print(classification_report(y_test, y_pred, digits=4))
print('Anzahl von Features: ', len(tvec.get_feature_names()))

LinearSVC()
Accuracy: 79.35%
--------------------------------------------------------------------------------
Evaluationsmaße

              precision    recall  f1-score   support

           0     0.8004    0.7824    0.7913     39951
           1     0.7868    0.8045    0.7956     39877

    accuracy                         0.7935     79828
   macro avg     0.7936    0.7935    0.7934     79828
weighted avg     0.7936    0.7935    0.7934     79828

Anzahl von Features:  262636


In [28]:
SEED = 26105111
csv = 'clean_tweet.csv'
my_df = pd.read_csv(csv)
my_df.dropna(inplace=True)
my_df.reset_index(drop=True,inplace=True)

x = my_df.text
y = my_df.sentiment
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=.05, random_state=SEED)

tvec = TfidfVectorizer()
classifier = MultinomialNB()
tvec.fit(x_train)
x_train= tvec.transform(x_train)
x_test = tvec.transform(x_test)
sentiment_fit = classifier.fit(x_train, y_train)
y_pred = sentiment_fit.predict(x_test)
print(classifier)
print("Accuracy: {0:.2f}%".format(accuracy_score(y_test, y_pred)*100))
print("-"*80)
print("Evaluationsmaße\n")
print(classification_report(y_test, y_pred, digits=4))
print('Anzahl von Features: ', len(tvec.get_feature_names()))

MultinomialNB()
Accuracy: 77.27%
--------------------------------------------------------------------------------
Evaluationsmaße

              precision    recall  f1-score   support

           0     0.7640    0.7898    0.7767     39951
           1     0.7821    0.7556    0.7686     39877

    accuracy                         0.7727     79828
   macro avg     0.7730    0.7727    0.7727     79828
weighted avg     0.7730    0.7727    0.7727     79828

Anzahl von Features:  262636


### Test 4: Lexikalischer Ansatz (TextBlob)
Der weitere Test prüft die Leistung der lexikalischen Methode in Bezug auf die vorgelegten
Daten, wobei die sogenannte Methode TextBlob angewendet wird

In [13]:
x = my_df.text
y = my_df.sentiment
from sklearn.model_selection import train_test_split
SEED = 2000
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=.05, random_state=SEED)

from textblob import TextBlob
tbresult = [TextBlob(i).sentiment.polarity for i in x_test] 
tbpred = [0 if n < 0 else 1 for n in tbresult]
conmat = np.array(confusion_matrix(y_test, tbpred, labels= [1,0]))
confusion = pd.DataFrame(conmat, index=['positive', 'negative'], columns=
['predicted_positive','predicted_negative'])
print("Accuracy Score: {0:.2f}%".format(accuracy_score(y_test, tbpred)*100))
print("-"*80)
print("Confusion Matrix\n")
print(confusion)
print("-"*80)
print("Classification Report\n")
print(classification_report(y_test, tbpred, digits=4))

Accuracy Score: 61.06%
--------------------------------------------------------------------------------
Confusion Matrix

          predicted_positive  predicted_negative
positive               35863                4029
negative               27055               12881
--------------------------------------------------------------------------------
Classification Report

              precision    recall  f1-score   support

           0     0.7617    0.3225    0.4532     39936
           1     0.5700    0.8990    0.6977     39892

    accuracy                         0.6106     79828
   macro avg     0.6659    0.6108    0.5754     79828
weighted avg     0.6659    0.6106    0.5754     79828



### Testfall 5: Vergleich von Features und N-Gramme
Im letzten Test wird die Auswirkung der N-Gramme sowie der Anzahl der Features auf die Klassifikatoren untersucht. Es werden nicht nur Unigramme, sondern auch Bigramme und Trigramme berücksichtigt. Für jeden Test werden Klassifikatoren (LR, SVM und NB) mit einer unterschiedlichen Anzahl von Features (zwischen 10 000 und 100 000 Features) trainiert.

In [29]:
# bereinigte Daten importieren und aufsplitten
# Beachten, dass der verwendeten Datensatz keine Stoppwörter hat
SEED = 26105111
csv = 'clean_tweet.csv'
my_df = pd.read_csv(csv)
my_df.dropna(inplace=True)
my_df.reset_index(drop=True,inplace=True)

x = my_df.text
y = my_df.sentiment
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=.05, random_state=SEED)

- N-Gramm: Unigramm, Bigramm oder Trigram
- Anzahl der Features: 10000 bis 100000 Features
- Algorithmen: Logitische Regression, SVM, NB

In [30]:
def accuracy_summary(classifier,vectorizer, x_train, y_train, x_test, y_test):
    x_train = vectorizer.transform(x_train)
    x_test = vectorizer.transform(x_test)
    sentiment_fit = classifier.fit(x_train, y_train)
    y_pred = sentiment_fit.predict(x_test)
    print("Accuracy: {0:.2f}%".format(accuracy_score(y_test, y_pred)*100))
    print("-"*80)
    print("Classification Report\n")
    print(classification_report(y_test, y_pred, digits=4))

### 5.1 Logitische Regression

In [32]:
from sklearn.linear_model import LogisticRegression

n_features = np.arange(10000,100001,10000)
tvec = TfidfVectorizer()
lr = LogisticRegression(C = 2, max_iter = 1000, n_jobs=-1)

def nfeature_accuracy_checker(vectorizer=tvec, n_features=n_features, stop_words=None, ngram_range=(1, 1), classifier=lr):
    result = []
    print (classifier)
    print ("\n")
    for n in n_features:
        vectorizer.set_params(max_features=n, ngram_range=ngram_range)
        vectorizer.fit(x_train)
        print("Ergebnisse bei {} Features".format(n))
        nfeature_accuracy = accuracy_summary(lr, tvec, x_train, y_train, x_test, y_test)
        result.append((n,nfeature_accuracy))
    return result

### Unigramme

In [34]:
feature_result_1gram = nfeature_accuracy_checker(vectorizer=tvec)

LogisticRegression(C=2, max_iter=1000, n_jobs=-1)


Ergebnisse bei 10000 Features
Accuracy: 79.52%
--------------------------------------------------------------------------------
Classification Report

              precision    recall  f1-score   support

           0     0.8041    0.7811    0.7924     39951
           1     0.7868    0.8094    0.7979     39877

    accuracy                         0.7952     79828
   macro avg     0.7954    0.7952    0.7952     79828
weighted avg     0.7954    0.7952    0.7952     79828

Ergebnisse bei 20000 Features
Accuracy: 79.82%
--------------------------------------------------------------------------------
Classification Report

              precision    recall  f1-score   support

           0     0.8064    0.7852    0.7957     39951
           1     0.7903    0.8112    0.8006     39877

    accuracy                         0.7982     79828
   macro avg     0.7984    0.7982    0.7981     79828
weighted avg     0.7984    0.7982    0.7981    

### Bigramme

In [35]:
feature_result_2gram = nfeature_accuracy_checker(vectorizer=tvec,ngram_range=(1, 2))

LogisticRegression(C=2, max_iter=1000, n_jobs=-1)


Ergebnisse bei 10000 Features
Accuracy: 80.56%
--------------------------------------------------------------------------------
Classification Report

              precision    recall  f1-score   support

           0     0.8133    0.7937    0.8034     39951
           1     0.7982    0.8174    0.8077     39877

    accuracy                         0.8056     79828
   macro avg     0.8058    0.8056    0.8056     79828
weighted avg     0.8058    0.8056    0.8056     79828

Ergebnisse bei 20000 Features
Accuracy: 81.35%
--------------------------------------------------------------------------------
Classification Report

              precision    recall  f1-score   support

           0     0.8213    0.8016    0.8114     39951
           1     0.8059    0.8253    0.8155     39877

    accuracy                         0.8135     79828
   macro avg     0.8136    0.8135    0.8134     79828
weighted avg     0.8136    0.8135    0.8134    

### Trigramme

In [36]:
feature_result_3gram = nfeature_accuracy_checker(vectorizer=tvec,ngram_range=(1, 3))

LogisticRegression(C=2, max_iter=1000, n_jobs=-1)


Ergebnisse bei 10000 Features
Accuracy: 80.52%
--------------------------------------------------------------------------------
Classification Report

              precision    recall  f1-score   support

           0     0.8134    0.7925    0.8028     39951
           1     0.7973    0.8178    0.8075     39877

    accuracy                         0.8052     79828
   macro avg     0.8054    0.8052    0.8051     79828
weighted avg     0.8054    0.8052    0.8051     79828

Ergebnisse bei 20000 Features
Accuracy: 81.33%
--------------------------------------------------------------------------------
Classification Report

              precision    recall  f1-score   support

           0     0.8221    0.8002    0.8110     39951
           1     0.8050    0.8265    0.8156     39877

    accuracy                         0.8133     79828
   macro avg     0.8136    0.8133    0.8133     79828
weighted avg     0.8136    0.8133    0.8133    

### 5.2 Support Vector Machine

In [37]:
from sklearn.svm import LinearSVC
n_features = np.arange(10000,100001,10000)
tvec = TfidfVectorizer()
svm = LinearSVC()

In [38]:
def nfeature_accuracy_checker(vectorizer=tvec, n_features=n_features, stop_words=None, ngram_range=(1, 1), classifier=svm):
    result = []
    print (classifier)
    print ("\n")
    for n in n_features:
        vectorizer.set_params(max_features=n, ngram_range=ngram_range)
        vectorizer.fit(x_train)
        print("Ergebnisse bei {} Features".format(n))
        nfeature_accuracy = accuracy_summary(svm, tvec, x_train, y_train, x_test, y_test)
        result.append((n,nfeature_accuracy))
    return result

### Unigramme

In [39]:
feature_result_1gram = nfeature_accuracy_checker(vectorizer=tvec, classifier=svm)

LinearSVC()


Ergebnisse bei 10000 Features
Accuracy: 79.50%
--------------------------------------------------------------------------------
Classification Report

              precision    recall  f1-score   support

           0     0.8057    0.7780    0.7916     39951
           1     0.7850    0.8120    0.7983     39877

    accuracy                         0.7950     79828
   macro avg     0.7953    0.7950    0.7949     79828
weighted avg     0.7953    0.7950    0.7949     79828

Ergebnisse bei 20000 Features
Accuracy: 79.71%
--------------------------------------------------------------------------------
Classification Report

              precision    recall  f1-score   support

           0     0.8070    0.7816    0.7941     39951
           1     0.7879    0.8127    0.8001     39877

    accuracy                         0.7971     79828
   macro avg     0.7974    0.7971    0.7971     79828
weighted avg     0.7974    0.7971    0.7971     79828

Ergebnisse bei 30000 Features


### Bigramme

In [40]:
feature_result_2gram = nfeature_accuracy_checker(vectorizer=tvec,ngram_range=(1, 2))

LinearSVC()


Ergebnisse bei 10000 Features
Accuracy: 80.48%
--------------------------------------------------------------------------------
Classification Report

              precision    recall  f1-score   support

           0     0.8148    0.7894    0.8019     39951
           1     0.7954    0.8203    0.8076     39877

    accuracy                         0.8048     79828
   macro avg     0.8051    0.8048    0.8048     79828
weighted avg     0.8051    0.8048    0.8048     79828

Ergebnisse bei 20000 Features
Accuracy: 81.29%
--------------------------------------------------------------------------------
Classification Report

              precision    recall  f1-score   support

           0     0.8233    0.7973    0.8101     39951
           1     0.8032    0.8285    0.8156     39877

    accuracy                         0.8129     79828
   macro avg     0.8132    0.8129    0.8129     79828
weighted avg     0.8132    0.8129    0.8129     79828

Ergebnisse bei 30000 Features


### Trigramme

In [41]:
feature_result_3gram = nfeature_accuracy_checker(vectorizer=tvec,ngram_range=(1, 3))

LinearSVC()


Ergebnisse bei 10000 Features
Accuracy: 80.37%
--------------------------------------------------------------------------------
Classification Report

              precision    recall  f1-score   support

           0     0.8138    0.7881    0.8008     39951
           1     0.7942    0.8194    0.8066     39877

    accuracy                         0.8037     79828
   macro avg     0.8040    0.8038    0.8037     79828
weighted avg     0.8041    0.8037    0.8037     79828

Ergebnisse bei 20000 Features
Accuracy: 81.21%
--------------------------------------------------------------------------------
Classification Report

              precision    recall  f1-score   support

           0     0.8230    0.7956    0.8091     39951
           1     0.8018    0.8286    0.8150     39877

    accuracy                         0.8121     79828
   macro avg     0.8124    0.8121    0.8120     79828
weighted avg     0.8124    0.8121    0.8120     79828

Ergebnisse bei 30000 Features


### 5.3 Naive Bayes Modell

In [43]:
from sklearn.naive_bayes import MultinomialNB, BernoulliNB
multiNB = MultinomialNB()

def nfeature_accuracy_checker(vectorizer=tvec, n_features=n_features, stop_words=None, ngram_range=(1, 1), classifier=multiNB):
    result = []
    print (classifier)
    print ("\n")
    for n in n_features:
        vectorizer.set_params(max_features=n, ngram_range=ngram_range)
        vectorizer.fit(x_train)
        print("Ergebnisse bei {} Features".format(n))
        nfeature_accuracy = accuracy_summary(multiNB, tvec, x_train, y_train, x_test, y_test)
        result.append((n,nfeature_accuracy))
    return result

### Unigramme

In [44]:
feature_result_1gram = nfeature_accuracy_checker(vectorizer=tvec)

MultinomialNB()


Ergebnisse bei 10000 Features
Accuracy: 77.10%
--------------------------------------------------------------------------------
Classification Report

              precision    recall  f1-score   support

           0     0.7697    0.7741    0.7719     39951
           1     0.7724    0.7679    0.7701     39877

    accuracy                         0.7710     79828
   macro avg     0.7710    0.7710    0.7710     79828
weighted avg     0.7710    0.7710    0.7710     79828

Ergebnisse bei 20000 Features
Accuracy: 77.37%
--------------------------------------------------------------------------------
Classification Report

              precision    recall  f1-score   support

           0     0.7717    0.7779    0.7748     39951
           1     0.7757    0.7694    0.7725     39877

    accuracy                         0.7737     79828
   macro avg     0.7737    0.7737    0.7737     79828
weighted avg     0.7737    0.7737    0.7737     79828

Ergebnisse bei 30000 Featu

### Bigramme

In [45]:
feature_result_2gram = nfeature_accuracy_checker(vectorizer=tvec,ngram_range=(1, 2))

MultinomialNB()


Ergebnisse bei 10000 Features
Accuracy: 78.23%
--------------------------------------------------------------------------------
Classification Report

              precision    recall  f1-score   support

           0     0.7794    0.7879    0.7836     39951
           1     0.7852    0.7766    0.7809     39877

    accuracy                         0.7823     79828
   macro avg     0.7823    0.7823    0.7822     79828
weighted avg     0.7823    0.7823    0.7822     79828

Ergebnisse bei 20000 Features
Accuracy: 78.94%
--------------------------------------------------------------------------------
Classification Report

              precision    recall  f1-score   support

           0     0.7864    0.7951    0.7907     39951
           1     0.7924    0.7836    0.7880     39877

    accuracy                         0.7894     79828
   macro avg     0.7894    0.7894    0.7894     79828
weighted avg     0.7894    0.7894    0.7894     79828

Ergebnisse bei 30000 Featu

### Trigramme

In [46]:
feature_result_3gram = nfeature_accuracy_checker(vectorizer=tvec,ngram_range=(1, 3))

MultinomialNB()


Ergebnisse bei 10000 Features
Accuracy: 78.01%
--------------------------------------------------------------------------------
Classification Report

              precision    recall  f1-score   support

           0     0.7776    0.7850    0.7813     39951
           1     0.7826    0.7751    0.7788     39877

    accuracy                         0.7801     79828
   macro avg     0.7801    0.7801    0.7801     79828
weighted avg     0.7801    0.7801    0.7801     79828

Ergebnisse bei 20000 Features
Accuracy: 78.87%
--------------------------------------------------------------------------------
Classification Report

              precision    recall  f1-score   support

           0     0.7863    0.7933    0.7898     39951
           1     0.7911    0.7840    0.7875     39877

    accuracy                         0.7887     79828
   macro avg     0.7887    0.7887    0.7887     79828
weighted avg     0.7887    0.7887    0.7887     79828

Ergebnisse bei 30000 Featu