# Competition

# Task Overview
You are given a dataset of top news (of a day) and want to predict the movement (1 for up and 0 for down) of the market value.

Download the data from [competition page](https://www.kaggle.com/t/664260fab9b04f699426b48a29ff7d05). This is also where you will upload your submissions.

You need to improve the accuracy of the model as much as you can.

## Rules:
1. Do not use any external data **NOR** models pre-trained on other datasets
2. Use the test set **ONLY** to get predictions for your model. For example, do not use it to compute statistics or features (e.g. learning preprocessing).
3. Do not use deep learning models for a fair competition
4. Don't cheat :)

## Hints
Here are several techniques that you can use:

1. **Tune your hyper-parameters** Try `GridSerachCV` function from `sklearn.model_selection` to find the best set of hyperparameters.
2. **Feature engineering** Play with the representation of the textual data. We only tried one, but there are more (e.g. TF-IDF Vectorizer is another powerful method to transform text to a vector, taking into account the rareness of the words across the texts). Also do not hesitate to play with the arguments of the *Vectorizers*. 
3. **Change your model** You are not restricted to train `LogisticRegression` only. You can use whatever algorithm you're already familiar with. Moreover, you can use the algorithms that you get to know during these 3 weeks of solving this assignment. E.g. give *RandomForests* a try!
4. **Use date** You can also use the date as extra features, think how you can use it and look for some patterns!
5. **Combine multiple models** You can train multiple models and use their individual predictions to produce a final, improved prediction.

## Scoring rules [16 points + 20 bonus points]
You have until **22.11.2023** to submit your tuned solutions.  
**You also need to submit the code for your best solution before the deadline.**

### **Part of the Assignment grade: [16 points]**
You need to beat two thresholds in order to get a full set of points for the assignment:

- You get **4 points** if you get at least 55% on the public board (`Super Easy Baseline`).

- You get another **12 points** if you beat the **easy baseline** - 58%. (We also added two hard baselines just for a point of reference)

### **Bonus points [up to 20 points]**
- **Top-5** on the final leaderboard get **20 bonus points**

- **Top-10** on the final leaderboard get **15 bonus points**

- **Top-15** on the final leaderboard get **10 bonus points**

- **Top-25** on the final leaderboard get **5 bonus points**

In [78]:
import numpy as np
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction.text import CountVectorizer

train = pd.read_csv('train.csv')
test = pd.read_csv('test.csv')

In [79]:

train.head(3) # look at the training data

Unnamed: 0,Date,Label,Top1,Top2,Top3,Top4,Top5,Top6,Top7,Top8,...,Top16,Top17,Top18,Top19,Top20,Top21,Top22,Top23,Top24,Top25
0,2008-08-08,0,"b""Georgia 'downs two Russian warplanes' as cou...",b'BREAKING: Musharraf to be impeached.',b'Russia Today: Columns of troops roll into So...,b'Russian tanks are moving towards the capital...,"b""Afghan children raped with 'impunity,' U.N. ...",b'150 Russian tanks have entered South Ossetia...,"b""Breaking: Georgia invades South Ossetia, Rus...","b""The 'enemy combatent' trials are nothing but...",...,b'Georgia Invades South Ossetia - if Russia ge...,b'Al-Qaeda Faces Islamist Backlash',"b'Condoleezza Rice: ""The US would not act to p...",b'This is a busy day: The European Union has ...,"b""Georgia will withdraw 1,000 soldiers from Ir...",b'Why the Pentagon Thinks Attacking Iran is a ...,b'Caucasus in crisis: Georgia invades South Os...,b'Indian shoe manufactory - And again in a se...,b'Visitors Suffering from Mental Illnesses Ban...,"b""No Help for Mexico's Kidnapping Surge"""
1,2008-08-11,1,b'Why wont America and Nato help us? If they w...,b'Bush puts foot down on Georgian conflict',"b""Jewish Georgian minister: Thanks to Israeli ...",b'Georgian army flees in disarray as Russians ...,"b""Olympic opening ceremony fireworks 'faked'""",b'What were the Mossad with fraudulent New Zea...,b'Russia angered by Israeli military sale to G...,b'An American citizen living in S.Ossetia blam...,...,b'Israel and the US behind the Georgian aggres...,"b'""Do not believe TV, neither Russian nor Geor...",b'Riots are still going on in Montreal (Canada...,b'China to overtake US as largest manufacturer',b'War in South Ossetia [PICS]',b'Israeli Physicians Group Condemns State Tort...,b' Russia has just beaten the United States ov...,b'Perhaps *the* question about the Georgia - R...,b'Russia is so much better at war',"b""So this is what it's come to: trading sex fo..."
2,2008-08-12,0,b'Remember that adorable 9-year-old who sang a...,"b""Russia 'ends Georgia operation'""","b'""If we had no sexual harassment we would hav...","b""Al-Qa'eda is losing support in Iraq because ...",b'Ceasefire in Georgia: Putin Outmaneuvers the...,b'Why Microsoft and Intel tried to kill the XO...,b'Stratfor: The Russo-Georgian War and the Bal...,"b""I'm Trying to Get a Sense of This Whole Geor...",...,b'U.S. troops still in Georgia (did you know t...,b'Why Russias response to Georgia was right',"b'Gorbachev accuses U.S. of making a ""serious ...","b'Russia, Georgia, and NATO: Cold War Two'",b'Remember that adorable 62-year-old who led y...,b'War in Georgia: The Israeli connection',b'All signs point to the US encouraging Georgi...,b'Christopher King argues that the US and NATO...,b'America: The New Mexico?',"b""BBC NEWS | Asia-Pacific | Extinction 'by man..."


In [80]:
test.head(3) # and the test

Unnamed: 0,ID,Date,Top1,Top2,Top3,Top4,Top5,Top6,Top7,Top8,...,Top16,Top17,Top18,Top19,Top20,Top21,Top22,Top23,Top24,Top25
0,0,2015-01-02,Most cases of cancer are the result of sheer b...,Iran dismissed United States efforts to fight ...,Poll: One in 8 Germans would join anti-Muslim ...,UK royal family's Prince Andrew named in US la...,Some 40 asylum-seekers refused to leave the bu...,Pakistani boat blows self up after India navy ...,Sweden hit by third mosque arson attack in a week,940 cars set alight during French New Year,...,Ukrainian minister threatens TV channel with c...,Palestinian President Mahmoud Abbas has entere...,Israeli security center publishes names of 50 ...,The year 2014 was the deadliest year yet in Sy...,A Secret underground complex built by the Nazi...,Restrictions on Web Freedom a Major Global Iss...,Austrian journalist Erich Mchel delivered a pr...,Thousands of Ukraine nationalists march in Kiev,Chinas New Years Resolution: No More Harvestin...,Authorities Pull Plug on Russia's Last Politic...
1,1,2015-01-05,Moscow-&gt;Beijing high speed train will reduc...,Two ancient tombs were discovered in Egypt on ...,China complains to Pyongyang after N Korean so...,Scotland Headed Towards Being Fossil Fuel-Free...,Prime Minister Shinzo Abe said Monday he will ...,Sex slave at centre of Prince Andrew scandal f...,Gay relative of Hamas founder faces deportatio...,The number of female drug addicts in Iran has ...,...,The Islamic State has approved a 2015 budget o...,"Iceland To Withdraw EU Application, Lift Capit...",Blackfield Capital Founder Goes Missing: The v...,Rocket stage crashes back to Earth in rural Ch...,2 Dead as Aircraft Bombs Greek Tanker in Libya...,Belgian murderer Frank Van Den Bleeken to die ...,Czech President criticizes Ukrainian PM; says ...,3 Vietnamese jets join search for 16 missing F...,France seeks end to Russia sanctions over Ukraine,China scraps rare earths caps
2,2,2015-01-06,US oil falls below $50 a barrel,"Toyota gives away 5,680 fuel cell patents to b...",Young Indian couple who had been granted polic...,A senior figure in Islamic States self-declare...,Fukushima rice passes radiation tests for 1st ...,Nearly all Spanish parties guilty of financial...,King Abdullah to abdicate Saudi Throne,Taliban Commander Caught Networking On LinkedIn,...,Thousands of Indians have fled from their home...,Turkey sacks judges who oversaw Erdogan corrup...,SpaceX Falcon 9 launch and recovery has been a...,CNN: Americans charged in botched Gambia coup,Islamic State 'Police' Official Beheaded.,Libya bans Palestinians from country to preven...,A judicial inquiry was opened in France on Mon...,Video has captured the moment a cameraman was ...,Syria has complained to the United Nations tha...,"Tests over, India set to make the iris of bigg..."


In [81]:
# naively concatenating all the news
X_train = [' '.join(str(x) for x in train.iloc[row,2:27]) for row in range(len(train.index))]
X_test = [' '.join(str(x) for x in test.iloc[row,2:27]) for row in range(len(test.index))]

creating a local validation set (since we don't know the test labels and we have limited (per day) submissions to kaggle)

In [82]:
from sklearn.model_selection import train_test_split
X_train, X_val, y_train, y_val = train_test_split(X_train, train.Label, test_size=0.2, random_state=42)

One needs to transform the data to the format that can be used with the known classifiers.

We need to represent each text as a classifier-friendly representation, for example: bag of words.

Using *CountVectorizer* from *sklearn.feature_extraction.text* we can transform the *news* to a data matrix *X* of shape [num_days, vocabulary_size], where each row represents a single text and each column indicates the number of occurences of a specific word across the dataset.
Notice that the Vectorizer has a lot of useful arguments. These could potentially influence the performance of the models.

In [83]:
# use a simple 1-gram encoder to encode texts
vectorizer = CountVectorizer()
X_train = vectorizer.fit_transform(X_train)
X_val = vectorizer.transform(X_val)

In [84]:
# simple logistic regression and using it on transformed test cases
model = LogisticRegression(max_iter=1000).fit(X_train, y_train)
preds = model.predict(vectorizer.transform(X_test))

just to have a sense about our simple classifier, we will evaluate it on validation set.

In [85]:
print((model.predict(X_val) == y_val).mean() * 100.0)

56.03715170278638


In [86]:
# creating a submission file for kaggle
pd.DataFrame({'ID': np.arange(len(preds)), 'Label': preds}).to_csv('submission_1gram.csv', index=False)

## Improvement of model 

#### Data preprocessing

- lowercasing / cleaning 

In [87]:
train.head()

Unnamed: 0,Date,Label,Top1,Top2,Top3,Top4,Top5,Top6,Top7,Top8,...,Top16,Top17,Top18,Top19,Top20,Top21,Top22,Top23,Top24,Top25
0,2008-08-08,0,"b""Georgia 'downs two Russian warplanes' as cou...",b'BREAKING: Musharraf to be impeached.',b'Russia Today: Columns of troops roll into So...,b'Russian tanks are moving towards the capital...,"b""Afghan children raped with 'impunity,' U.N. ...",b'150 Russian tanks have entered South Ossetia...,"b""Breaking: Georgia invades South Ossetia, Rus...","b""The 'enemy combatent' trials are nothing but...",...,b'Georgia Invades South Ossetia - if Russia ge...,b'Al-Qaeda Faces Islamist Backlash',"b'Condoleezza Rice: ""The US would not act to p...",b'This is a busy day: The European Union has ...,"b""Georgia will withdraw 1,000 soldiers from Ir...",b'Why the Pentagon Thinks Attacking Iran is a ...,b'Caucasus in crisis: Georgia invades South Os...,b'Indian shoe manufactory - And again in a se...,b'Visitors Suffering from Mental Illnesses Ban...,"b""No Help for Mexico's Kidnapping Surge"""
1,2008-08-11,1,b'Why wont America and Nato help us? If they w...,b'Bush puts foot down on Georgian conflict',"b""Jewish Georgian minister: Thanks to Israeli ...",b'Georgian army flees in disarray as Russians ...,"b""Olympic opening ceremony fireworks 'faked'""",b'What were the Mossad with fraudulent New Zea...,b'Russia angered by Israeli military sale to G...,b'An American citizen living in S.Ossetia blam...,...,b'Israel and the US behind the Georgian aggres...,"b'""Do not believe TV, neither Russian nor Geor...",b'Riots are still going on in Montreal (Canada...,b'China to overtake US as largest manufacturer',b'War in South Ossetia [PICS]',b'Israeli Physicians Group Condemns State Tort...,b' Russia has just beaten the United States ov...,b'Perhaps *the* question about the Georgia - R...,b'Russia is so much better at war',"b""So this is what it's come to: trading sex fo..."
2,2008-08-12,0,b'Remember that adorable 9-year-old who sang a...,"b""Russia 'ends Georgia operation'""","b'""If we had no sexual harassment we would hav...","b""Al-Qa'eda is losing support in Iraq because ...",b'Ceasefire in Georgia: Putin Outmaneuvers the...,b'Why Microsoft and Intel tried to kill the XO...,b'Stratfor: The Russo-Georgian War and the Bal...,"b""I'm Trying to Get a Sense of This Whole Geor...",...,b'U.S. troops still in Georgia (did you know t...,b'Why Russias response to Georgia was right',"b'Gorbachev accuses U.S. of making a ""serious ...","b'Russia, Georgia, and NATO: Cold War Two'",b'Remember that adorable 62-year-old who led y...,b'War in Georgia: The Israeli connection',b'All signs point to the US encouraging Georgi...,b'Christopher King argues that the US and NATO...,b'America: The New Mexico?',"b""BBC NEWS | Asia-Pacific | Extinction 'by man..."
3,2008-08-13,0,b' U.S. refuses Israel weapons to attack Iran:...,"b""When the president ordered to attack Tskhinv...",b' Israel clears troops who killed Reuters cam...,b'Britain\'s policy of being tough on drugs is...,b'Body of 14 year old found in trunk; Latest (...,b'China has moved 10 *million* quake survivors...,"b""Bush announces Operation Get All Up In Russi...",b'Russian forces sink Georgian ships ',...,b'Elephants extinct by 2020?',b'US humanitarian missions soon in Georgia - i...,"b""Georgia's DDOS came from US sources""","b'Russian convoy heads into Georgia, violating...",b'Israeli defence minister: US against strike ...,b'Gorbachev: We Had No Choice',b'Witness: Russian forces head towards Tbilisi...,b' Quarter of Russians blame U.S. for conflict...,b'Georgian president says US military will ta...,b'2006: Nobel laureate Aleksander Solzhenitsyn...
4,2008-08-14,1,b'All the experts admit that we should legalis...,b'War in South Osetia - 89 pictures made by a ...,b'Swedish wrestler Ara Abrahamian throws away ...,b'Russia exaggerated the death toll in South O...,b'Missile That Killed 9 Inside Pakistan May Ha...,"b""Rushdie Condemns Random House's Refusal to P...",b'Poland and US agree to missle defense deal. ...,"b'Will the Russians conquer Tblisi? Bet on it,...",...,b'Bank analyst forecast Georgian crisis 2 days...,"b""Georgia confict could set back Russia's US r...",b'War in the Caucasus is as much the product o...,"b'""Non-media"" photos of South Ossetia/Georgia ...",b'Georgian TV reporter shot by Russian sniper ...,b'Saudi Arabia: Mother moves to block child ma...,b'Taliban wages war on humanitarian aid workers',"b'Russia: World ""can forget about"" Georgia\'s...",b'Darfur rebels accuse Sudan of mounting major...,b'Philippines : Peace Advocate say Muslims nee...


In [88]:
print("Missing values per column:")
print(train.isnull().sum())

Missing values per column:
Date     0
Label    0
Top1     0
Top2     0
Top3     0
Top4     0
Top5     0
Top6     0
Top7     0
Top8     0
Top9     0
Top10    0
Top11    0
Top12    0
Top13    0
Top14    0
Top15    0
Top16    0
Top17    0
Top18    0
Top19    0
Top20    0
Top21    0
Top22    0
Top23    1
Top24    3
Top25    3
dtype: int64


#### using C (class-based) TF-IDF

In [14]:
import numpy as np
import pandas as pd
import scipy.sparse as sp
from sklearn.preprocessing import normalize
from sklearn.feature_extraction.text import TfidfTransformer, CountVectorizer


class CTFIDFVectorizer(TfidfTransformer):
    """ it is not a transfomer model/or pretained model - read usage for tfidftransformer for more info."""
    def __init__(self, *args, **kwargs):
        super(CTFIDFVectorizer, self).__init__(*args, **kwargs)

    def fit(self, X: sp.csr_matrix, n_samples: int):
        """learn idf vector (global term weights) """
        _, n_features = X.shape
        df = np.squeeze(np.asarray(X.sum(axis=0)))
        idf = np.log(n_samples / df)
        self._idf_diag = sp.diags(idf, offsets=0,
                                  shape=(n_features, n_features),
                                  format='csr',
                                  dtype=np.float64)
        return self

    def transform(self, X: sp.csr_matrix) -> sp.csr_matrix:
        """transform a count-based matrix to c-TF-IDF / class based tf-idf """
        X = X * self._idf_diag
        X = normalize(X, axis=1, norm='l1', copy=False)
        return X

In [15]:
import re

def clean_text(text):
    if text.startswith("b'") or text.startswith('b"'):
        text = text[2:-1]
    text = bytes(text, 'utf-8').decode('unicode_escape')
    text = re.sub(r'[^a-zA-Z\s]', '', text, re.I|re.A)
    text = text.lower()
    return text

In [16]:
for col in train.columns:
    if col.startswith('Top'):
        train[col] = train[col].astype(str).apply(clean_text)


#### Feature Engineering
-Vectorizer / feature / 

In [73]:
train['combined_text'] = train[['Top1', 'Top2', 'Top3', 'Top4', 'Top5', 'Top6', 'Top7', 'Top8', 'Top9', 'Top10', 'Top11', 'Top12', 'Top13', 'Top14', 'Top15', 'Top16', 'Top17', 'Top18', 'Top19', 'Top20', 'Top21', 'Top22', 'Top23', 'Top24', 'Top25']].apply(lambda x: ' '.join(x.dropna().astype(str)), axis=1)
train['combined_text'].head()
X_train = X_train.apply(clean_text)

In [74]:
# Your binary labels are assumed to be in a column named 'label'
X = train['combined_text']
y = train['Label']

X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42)

vectorizer = CountVectorizer()
X_train_counts = vectorizer.fit_transform(X_train)
ctfidf = CTFIDFVectorizer()
X_train_ctfidf = ctfidf.fit_transform(X_train_counts, n_samples=X_train.shape[0])


In [75]:
from sklearn.linear_model import LogisticRegression

# Initialize the model
model = LogisticRegression()

# Train the model
model.fit(X_train_ctfidf, y_train)

In [69]:
from sklearn.metrics import classification_report, confusion_matrix

# Transform the test set
X_test_counts = vectorizer.transform(X_test)
X_test_ctfidf = ctfidf.transform(X_test_counts)

In [70]:
y_pred = model.predict(X_test_ctfidf)

In [76]:
print((model.predict(X_val) == y_val).mean() * 100.0)

ValueError: could not convert string to float: 'b\'Brits take a stand - ban short-selling of financial stocks\' b\'Scientology could be banned in France if they lose fraud lawsuit. \' b"Spain\'s media in uproar over McCain\'s comments regarding Prime Minister Zapatero" b"Arabs across the ideological spectrum are denouncing cleric\'s fatwa on \'immoral\' TV" b\' Bitter Asians wag the finger at U.S. bank bailouts\' b"Rome\'s Ban On Skimpy Clothing Leads Prostitutes To Dress As Nuns" b"UCLA study of satellite imagery casts doubt on surge\'s success in Baghdad" b\'China Blames Wall Street Meltdown On Federal Reserve Overissuance Of Currency \' b\'Unknown piece of music by Mozart found in France\' b\'Interesting Places: Shibam: Manhattan of the Desert [pics]\' b\'UK Teenager sentenced to jail for two years for downloading a guide to making naplam\' b\'Why are people worried about short selling (Alex, financial cartoon)\' b\'US air raid kills Iraq civilians\' b\'Russian minister says war with the United States not possible\' b\'North Korea Preparing to Restart Nuclear Facility\' b\'America continues to lose world influence as Russia sells weapons to Iran and  Venezuela\' b\'Rice says U.S. will resist Russian moves against its neighbors\' b\'14 men rob a coca cola factory - shoot out with security - wtf!! \' b\'South Ossetia information warfare, two sides presented in one talk show\' b\'North Korea preparing to restart nuclear reactor\' b\' Venezuela expels U.S. rights group for criticism\' b\'Chavez expells Human Rights Watch from Venezuela for criticizing govt.\' b\'Euthanasia advocate takes her own life.\' b"The complete list of the world\'s billionaires" b"Swaziland\'s HIV/AIDS treatment crisis goes critical"'

In [65]:
print(classification_report(y_test, y_pred))
print(confusion_matrix(y_test, y_pred))


ValueError: Found input variables with inconsistent numbers of samples: [323, 378]

In [None]:

# Evaluate the model
print(classification_report(y_val, y_pred))
print(confusion_matrix(y_val, y_pred))

preds = model.predict(vectorizer.transform(X_test)) 

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import PCA
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt
import pandas as pd


def process_column(column_name):
    # TF-IDF Vectorization
    vectorizer = TfidfVectorizer(stop_words='english')
    X = vectorizer.fit_transform(train[column_name])

    # Clustering
    kmeans = KMeans(n_clusters=5)  # Adjust the number of clusters as needed
    kmeans.fit(X)
    labels = kmeans.labels_

    # Dimensionality Reduction
    pca = PCA(n_components=2)
    reduced_X = pca.fit_transform(X.todense())

    return reduced_X, labels, vectorizer.get_feature_names()

# Example usage for 'Top1'
reduced_X_top1, labels_top1, feature_names_top1 = process_column('Top1')

# Visualization for 'Top1'
plt.figure(figsize=(10, 6))
plt.scatter(reduced_X_top1[:, 0], reduced_X_top1[:, 1], c=labels_top1)
plt.title('Cluster Visualization for Top1')
plt.show()


  super()._check_params_vs_input(X, default_n_init=10)


TypeError: np.matrix is not supported. Please convert to a numpy array with np.asarray. For more information see: https://numpy.org/doc/stable/reference/generated/numpy.matrix.html

#### Model Selection 
- GridSearchCV best hyper parameter Tree/Logistic reg/SVM