## import necessary libraries

Import `pandas as pd`, `numpy as np` and `seaborn as sns`
Furthermore, import `Pipeline` from `sklearn.pipeline`, `RandomForestClassifier` from `sklearn.ensemble` and `train_test_split` from `sklearn.model_selection`.

Import `accuracy_score`, `f1_score`, `precision_score`, `hamming_loss` and `confusion_matrix` from `sklearn.metrics`. 

Finally, import `CountVectorizer` and `TfidfTransformer` from `sklearn.feature_extraction.text` and `pplot_cm` from `conf_matrix` (this script should be in your local repo) and `matplotlib.pyplot` as `plt`.


In [None]:
import ...

## load data

Load your previously saved csv dataframe using pandas' `read_csv()`.

In [None]:
data = ...

In [None]:
data

## plot label frequencies

Set the seaborn color palette to "deep" using `sns.set()`.
Then, plot the label frequencies using `sns.countplot()` on the column "sentiment" (or what ever you have called it).

In [None]:
sns.set(...)
...
plt.show()

## load stopwords

Use stop words to remove less-meaningful words. The logic of removing stop words has to do with the fact that these words don't carry a lot of meaning, and they appear a lot in most text. We have provided you with a list of common German stopwords ('data/stopwords_german.txt'). Import the packages `io` and `unidecode` first, then use `io.open()` and `readlines()` to save the words contained in the .txt file to a list. 

Call the python string function `strip()` to remove newline characters (`\n`) and unidecode's `unidecode()` on every element in the resulting list.

In [None]:
# you can also add your own stopwords in this step using append()
...

stopwords = ...

In [None]:
stopwords

## split data for training

To train and evaluate the model, we split the data into a training set and a test set using `train_test_split()`, the arguments  being the text column, the label/sentiment column, a test set size (`test_size=0.1` for 10%, `test_size=0.3` for 30%, etc.) and a integer of your choice as random_state.

You can then call `.shape` on the resulting sets to see their dimensions.

In [None]:
X_train, X_test, y_train, y_test = ...

## set up ML pipeline

Instantiate a pipeline by adding 3 steps: a `CountVectorizer()` `'vect'`, a `TfidfTransformer()` `'tfidf'` and a `RandomForestClassifier()` `'rf'`.

The [Countvectorizer](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html) helps us to create numerical values from text by counting the inherent tokens. Pass `analyzer='word'`, `strip_accents='unicode'` and `lowercase=True`. Pass your list of stopwords as `stop_words`.

The arguments for the `TfidfTransformer` are `use_idf=True` and `smooth_idf=True`.

Fit your pipeline to the training data by calling `fit()` on the pipeline object and passing the training texts and training labels.

In [None]:
pipeline = Pipeline([
    (
        'vect',
        ...
    ),
    (
        'tfidf',
        ...
    ),
    (
        'rf',
        ...
    )
])


In [None]:
# fit pipeline to training data

## score model

We have provided you with a function to score your model using the test texts and labels. In case of encoding issues calling `.values.astype('U')` on the texts before passing them to your pipeline might help.

In [None]:
def score_model(true, pred):
    print('Accuracy:', accuracy_score(true, pred))
    print('F1:', f1_score(true, pred, average='weighted'))
    print('Precision:', precision_score(true, pred, average='weighted'))
    print('Hamming loss', hamming_loss(true, pred))

## plot confusion matrix

To quickly plot a confusion matrix, use the provided function pplot_cm and pass the same arguments as with `score_model()`.

In [None]:
# plot a confucion matrix to visualize true positives, true negatives, ...
# https://en.wikipedia.org/wiki/Confusion_matrix

## manual tests

Pass the example texts from the repo description to `pipeline.predict()` and play around with new texts to get a feeling for how your model determines a sentiment.

In [1]:
# use your pipeline to create class predictions for the three example texts given in the readme