# Competition

# Task Overview
You are given a dataset of top news (of a day) and want to predict the movement (1 for up and 0 for down) of the market value.

Download the data from [competition page](https://www.kaggle.com/t/664260fab9b04f699426b48a29ff7d05). This is also where you will upload your submissions.

You need to improve the accuracy of the model as much as you can.

## Rules:
1. Do not use any external data **NOR** models pre-trained on other datasets
2. Use the test set **ONLY** to get predictions for your model. For example, do not use it to compute statistics or features (e.g. learning preprocessing).
3. Do not use deep learning models for a fair competition
4. Don't cheat :)

## Hints
Here are several techniques that you can use:

1. **Tune your hyper-parameters** Try `GridSerachCV` function from `sklearn.model_selection` to find the best set of hyperparameters.
2. **Feature engineering** Play with the representation of the textual data. We only tried one, but there are more (e.g. TF-IDF Vectorizer is another powerful method to transform text to a vector, taking into account the rareness of the words across the texts). Also do not hesitate to play with the arguments of the *Vectorizers*. 
3. **Change your model** You are not restricted to train `LogisticRegression` only. You can use whatever algorithm you're already familiar with. Moreover, you can use the algorithms that you get to know during these 3 weeks of solving this assignment. E.g. give *RandomForests* a try!
4. **Use date** You can also use the date as extra features, think how you can use it and look for some patterns!
5. **Combine multiple models** You can train multiple models and use their individual predictions to produce a final, improved prediction.

## Scoring rules [16 points + 20 bonus points]
You have until **22.11.2023** to submit your tuned solutions.  
**You also need to submit the code for your best solution before the deadline.**

### **Part of the Assignment grade: [16 points]**
You need to beat two thresholds in order to get a full set of points for the assignment:

- You get **4 points** if you get at least 55% on the public board (`Super Easy Baseline`).

- You get another **12 points** if you beat the **easy baseline** - 58%. (We also added two hard baselines just for a point of reference)

### **Bonus points [up to 20 points]**
- **Top-5** on the final leaderboard get **20 bonus points**

- **Top-10** on the final leaderboard get **15 bonus points**

- **Top-15** on the final leaderboard get **10 bonus points**

- **Top-25** on the final leaderboard get **5 bonus points**

In [None]:
import numpy as np
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction.text import CountVectorizer

train = pd.read_csv('train.csv')
test = pd.read_csv('test.csv')

In [None]:
train.head(3) # look at the training data

In [None]:
test.head(3) # and the test

In [None]:
# naively concatenating all the news
X_train = [' '.join(str(x) for x in train.iloc[row,2:27]) for row in range(len(train.index))]
X_test = [' '.join(str(x) for x in test.iloc[row,2:27]) for row in range(len(test.index))]

creating a local validation set (since we don't know the test labels and we have limited (per day) submissions to kaggle)

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_val, y_train, y_val = train_test_split(X_train, train.Label, test_size=0.2, random_state=42)

One needs to transform the data to the format that can be used with the known classifiers.

We need to represent each text as a classifier-friendly representation, for example: bag of words.

Using *CountVectorizer* from *sklearn.feature_extraction.text* we can transform the *news* to a data matrix *X* of shape [num_days, vocabulary_size], where each row represents a single text and each column indicates the number of occurences of a specific word across the dataset.
Notice that the Vectorizer has a lot of useful arguments. These could potentially influence the performance of the models.

In [None]:
# use a simple 1-gram encoder to encode texts
vectorizer = CountVectorizer()
X_train = vectorizer.fit_transform(X_train)
X_val = vectorizer.transform(X_val)

In [None]:
# simple logistic regression and using it on transformed test cases
model = LogisticRegression(max_iter=1000).fit(X_train, y_train)
preds = model.predict(vectorizer.transform(X_test))

just to have a sense about our simple classifier, we will evaluate it on validation set.

In [None]:
print((model.predict(X_val) == y_val).mean() * 100.0)

In [None]:
# creating a submission file for kaggle
pd.DataFrame({'ID': np.arange(len(preds)), 'Label': preds}).to_csv('submission_1gram.csv', index=False)