<a href="https://colab.research.google.com/github/ranieri-unimi/git.ammagamma/blob/main/TEXT_CLASSIFICATION.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Boring stuff: setting everything up

*Warning: run this section only once*

In [None]:
!pip install spacy-nightly --pre

In [None]:
!pip install -U pip setuptools wheel

In [None]:
!pip install -U spacy transformers

In [None]:
!git clone https://github.com/explosion/projects.git spacy-projects

In [None]:
!spacy project assets

# Sentiment analysis: Reddit Posts Dataset

*Example records [TEXT_CONTENT, EMOTION_ID, TEXT_ID]:*

You can take a look at the dataset [here](https://drive.google.com/file/d/118kEBuOXikDJhlAvDVmAVxNBymtQ5MKb/view?usp=sharing)

*   My favourite food is anything I didn't have to cook myself.	27	eebbqej 
*   Thank you friend	15	eeqd04y
*   It's crazy how far Photoshop has come. Underwater bridges?!! NEVER!!!	7,13	efanc6t


Check out **assets/categories.txt** to explore the labels for this dataset. *The first row corresponds to the emotion_id 0, the second row to the emotion_id 1 and so on.*

---



##***Edit [project.yml](/content/drive/MyDrive/NLP_MASTER/finance/project.yml) and change gpu_id from -1 to 0 in order to take advantage of the Colab GPU***

In [None]:
!spacy project run preprocess

In [None]:
!spacy project run train

In [None]:
!spacy project run evaluate

In [None]:
import spacy
nlp = spacy.load("./training/cnn/model-best")

texts = [
    "It was really bad to watch you leave, hopefully you'll be back soon",
    "Oh yes, I can relate to that. Still, you'd better think about it twice.",
]

for doc in nlp.pipe(texts):
    # Do something with the doc here
    print(doc.cats)

#Data Preparation: from the Reddit Post Dataset to the Financial News Dataset
**TODO: Upload Financial News Dataset file FinancialPhraseBank_AllAgree.txt to the assets folder, you can find the dataset [here](https://drive.google.com/file/d/1WXM2t8sh-myIEUZt37zIXC2McNrCyS2l/view?usp=sharing)**\

Financial news dataset example records [TEXT_CONTENT, SENTIMENT_LABEL]:


*   According to Gran , the company has no plans to move all production to Russia , although that is where the company is growing .@neutral
*   Finnish Talentum reports its operating profit increased to EUR 20.5 mn in 2005 from EUR 9.3 mn in 2004 , and net sales totaled EUR 103.3 mn , up from EUR 96.4 mn .@positive
*   Pharmaceuticals group Orion Corp reported a fall in its third-quarter earnings that were hit by larger expenditures on R&D and marketing .@negative



---

Now you have to **format the Financial News Dataset like the Reddit Posts Dataset**, in order to retrain the sentiment classifier on the new financial dataset.

Remember to split the dataset into train (70%), validation (10%) and test (20%), **saving the respective TSV files (train.tsv, dev.tsv, test.tsv) in the asset folder** .

---
## Hints:
- Our final dataset should have the following columns: text, label, id. Text and label are already in our file (in the same row!), while the ID should be generated uniquely (e.g. use uuid.uuid4())
- Categories are represented as strings (neutral, positive, negative), while spacy expects them as integer.
- Should we split the observations randomly or use some specific criteria?
- The train, val and test files should be stored as tab separated files (sep="\t") under the assets/ folder, with the following names: 
  - train.tsv
  - dev.tsv
  - test.tsv

In [None]:
### TO DO

If you didn't do it before, check out the file under assets/categories.txt : it contains the (many) labels for the sentiment classification of the Reddit Posts Dataset, now you have to **change it to the labels of the Financial News Dataset (neutral, positive, negative)**.

In [None]:
#!echo -en "neutral\npositive\nnegative" > /content/drive/MyDrive/NLP_MASTER/finance/assets/categories.txt
!echo -en "neutral\npositive\nnegative" > ./assets/categories.txt

Let again Spacy **preprocess our input files** (assets/train.tsv, assets/dev.tsv, assets/test.tsv and assets/categories.txt) and format them as it internally needs.

In [None]:
!spacy project run preprocess

Spacy is a bit picky about existing directories, **delete the previous CNN model** you trained on the Reddit Posts Dataset

In [None]:
#!rm -rf /content/drive/MyDrive/NLP_MASTER/finance/training/cnn
!rm -rf ./training/cnn

Everything is ready, **let's train the model** on the Financial News Dataset!

In [None]:
!spacy project run train

In [None]:
!spacy project run evaluate

# Running predictions on examples!

In [None]:
import spacy
nlp = spacy.load("./training/cnn/model-best")

texts = [
    "Net income was $9.4 million compared to the prior year of $2.7 million.",
    "Revenue exceeded twelve billion dollars, with a loss of $1b.",
]

for doc in nlp.pipe(texts):
    # Do something with the doc here
    print(doc.cats)


# Final task: sentiment as a Prophet regressor

## Main goal
**The presentations will start at 11:45 a.m. on Friday**
Forecast the EUR-USD exchange rate, using both timeseries (e.g. previous values) and news Downloaded from [ForexRate news archive.](http://www.forexrate.co.uk/newsarchive.php). 
## Dataset
You will use the dataset downloaded and used in previous labs.
As a test set use observations in the range [1st June 2021, 1st June 2022], extremes included.

In your presentation you should focus on the methodological approach you used for solving this problem **AND** the main insights to share with your business stakeholders.

## Metrics
You should use some of the metrics shown during the time-series lecture (or even better ones!) and motivate your choices. It will be certainly interesting to go beyond stating "the MAE is X.Y": are there any particular patterns? how performances varies throughout time? is it worth having a predictive model instead of "baseline" approaches?
## Presentation format
Each team, made of 3/4 members, will present their results to all of us in 15 minutes, using a brief Power point presentation and answer eventual questions (both from us and other teams!).

## Organizational stuff
**The presentations will start at 11:45 a.m. on Friday**

Until then you can work together with your team mates: please don't work on it overnight!!!
For us it's more interesting to see which insights will you share with business stakeholders and the statistical robustness of your methodological approaches, instead of seeing an infinitesimal improvement on your metrics of choice.

We will be available also tomorrow morning, from 9 a.m., for answering all your questions and/or help you solve some technical issues on the dedicated call.




In [None]:
import requests
from bs4 import BeautifulSoup

offset = 0
max_offset = 1649
offset_increment = 12

BASE_URL = 'http://www.forexrate.co.uk/'

In [None]:
news_archive = []

for i in range(0,max_offset,offset_increment):
  url = f'http://www.forexrate.co.uk/newsarchive.php?start={i}'
  print(url)
  page = requests.get(url)
  soup = BeautifulSoup(page.content, 'html.parser')
  tables = soup.findChildren('table')
  news_table = tables[1]
  rows = news_table.findChildren(['th', 'tr'])

  for idx,row in enumerate(rows):
    if idx == 0:
        continue
    cells = row.findChildren('td')
    for idx,cell in enumerate(cells):
      txt = cell.text
      href = cell.find('a')['href']
      href = BASE_URL + href.replace('./','')
      if "newsarchive.php?start=" in href:
        continue
      # let's get the date of the article
      date_page = requests.get(href)
      date_soup = BeautifulSoup(date_page.content, 'html.parser')
      date_div = date_soup.findChildren('div')[3]
      date_str = date_div.text
      news_archive.append({'txt':txt,'url':href,'date':date_str})
      print(len(news_archive), date_str, {'txt':txt,'url':href,'date':date_str})
      #print(value, href)
  #print(len(news_archive))

#print(news_archive)

In [None]:
import pandas as pd

df = pd.DataFrame(news_archive)

In [None]:
df.to_csv("./hist_fx.csv", index=False)

In [None]:
import pickle

with open('/content/drive/MyDrive/NLP_MASTER/news_archive.pkl', 'wb') as f:
  pickle.dump(news_archive, f)

with open('/content/drive/MyDrive/NLP_MASTER/news_archive.pkl', 'rb') as f:
  loaded_news_archive = pickle.load(f)

In [None]:
len(loaded_news_archive)