# Data Science and Machine Learning Pretest

*   This Colab is read-only. Please save a copy of it on your Drive to edit it by going to `Menu > File > Save a copy in Drive`.
*   Rename your Colab in the following format and replace 111423000 with your student ID:
> `Copy 111423000 of Data Science and Machine Learning pretest v23.02.ipynb`
*   You are required to complete this pretest **on your own**.
*   You MUST provide a list of referred resources at the end of your answers.
*   When you have completed the questions below, make sure you turn on the **share/edit/view persmission**.


# Question 1: Text preprocessing
Most of the text data acquired through web crawling and review can be noisy. When handling this kind of text data, preprocessing is an important step to ensure the quality of the dataset. There are several procedures in text preprocessing, including:
1. lowercase
2. decontracting
3. remove tags, punctuations, numbers
3. tokenization
4. stopword removal
5. lemmatization
6. stemming




## 1-1. Please briefly explain what each step does and describe how you would apply them in a text classification task. Use examples/practical applications in your answers. (30%)


### **1. lowercase**

將所有字母轉成小寫。可以確保不同文本間的一致性，大小寫在語義上是相同的，所以可以降低大小寫帶來的複雜度，有助於後續的文本分析


In [7]:
text = "I WANT PASSWORD CARD!!!"
lowercase_text = text.lower()
print(lowercase_text)

i want password card!!!


### **2. decontracting**

將縮寫還原成完整的形式，以便模型更好地處理和理解文本，藉此改善模型的性能

In [8]:
import re

def decontracted(phrase):
    phrase = re.sub(r"can\'t", "can not", phrase)
    phrase = re.sub(r"n\'t", " not", phrase)
    return phrase

text = "i can't believe i don't have password card."

print(decontracted(text))

i can not believe i do not have password card.


### **3. remove tags, punctuations, numbers**

移除不必要的標籤、標點符號或數字，可以獲得更乾淨、更易於處理的文本


In [9]:
import re

def preprocess_text(text):
    text = re.sub(r'<[^>]+>', '', text)
    text = re.sub(r"[^\w\s]", "", text)
    text = re.sub(r"\d+", "", text)
    return text

text = "<p>T,e,a,c,h,e,r, ,i,s, hand123123so4567me</p>"
processed_text = preprocess_text(text)
print(processed_text)

Teacher is handsome


### **4. tokenization**

將一段文本分解成更小的單元，切成更小的單元可以讓機器更加容易理解且易處理

In [10]:
import nltk
from nltk.tokenize import word_tokenize

nltk.download("punkt")

text = "I really want to enter this class to learn machine learning"
print(word_tokenize(text))

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


['I', 'really', 'want', 'to', 'enter', 'this', 'class', 'to', 'learn', 'machine', 'learning']


### **5. stopword removal**

將文本中常見且不具有意義的詞彙移除，例如：is, the。可以減少 bias 和降低計算成本


In [15]:
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

nltk.download('stopwords')
stop_words = set(stopwords.words("english"))

text = "I am handsome"

tokens = word_tokenize(text)

filtered_tokens = [word for word in tokens if word.lower() not in stop_words]

print("Original tokens:", tokens)
print("Filtered tokens:", filtered_tokens)

Original tokens: ['I', 'am', 'handsome']
Filtered tokens: ['handsome']


[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


### **6. lemmatization**

將單詞轉回原本的型態，比如：ate->eat。如此可以消除文本中的多樣性，使文本更加標準化和統一，易於進行後續的文本分析，以及降低模型訓練和計算的複雜度

In [16]:
from nltk.stem import WordNetLemmatizer

nltk.download('wordnet')

wnl = WordNetLemmatizer()
print(wnl.lemmatize("cards", "n"))
print(wnl.lemmatize("flying", "v"))
print(wnl.lemmatize("happiest", "a"))

card
fly
happy


[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


### **7. stemming**

一個單字可能因各種狀況而有不同的型態，如：時態、單複數。stemming 用於將單詞轉換為詞幹的過程，如此可減少詞形的變化以及空間大小，提高文本分析的準確性


In [17]:
from nltk.stem import PorterStemmer
from nltk.tokenize import word_tokenize

stemmer = PorterStemmer()

text = "I am running to the class for the password card."

tokens = word_tokenize(text)

stemmed_tokens = [stemmer.stem(word) for word in tokens]

print("Original tokens:", tokens)
print("Stemmed tokens:", stemmed_tokens)

Original tokens: ['I', 'am', 'running', 'to', 'the', 'class', 'for', 'the', 'password', 'card', '.']
Stemmed tokens: ['i', 'am', 'run', 'to', 'the', 'class', 'for', 'the', 'password', 'card', '.']


## **referred resources**

[Removing stop words](https://www.geeksforgeeks.org/removing-stop-words-nltk-python/)

[Stemming and Lemmatization](https://medium.com/programming-with-data/12-%E8%A9%9E%E5%B9%B9-%E8%A9%9E%E6%A2%9D%E6%8F%90%E5%8F%96-stemming-and-lemmatization-6e146bcc48b)



## 1-2. Please apply text preprocessing to the following sample text provided below.(70%)

In [18]:
documents = ["Product arrived labeled as Jumbo Salted Peanuts...the peanuts were actually small sized unsalted. Not sure if this was an error or if the vendor intended to represent the product as 'Jumbo'",
        "I WAS VISITING MY FRIEND NATE THE OTHER MORNING FOR COFFEE , HE CAME OUT OF HIS STORAGE ROOM WITH ( A PACKET OF McCANNS INSTANT IRISH OATMEAL .) HE SUGGESTED THAT I TRY IT FOR MY OWN USE ,IN MY STASH . SOMETIMES NATE DOSE NOT GIVE YOU A CHANCE TO SAY NO , SO I ENDED UP TRYING THE APPLE AND CINN . FOUND IT TO BE VERY TASTEFULL WHEN MADE WITH WATER OR POWDERED MILK . IT GOES GOOD WITH O.J. AND COFFEE AND A SLICE OF TOAST AND YOUR READY TO TAKE ON THE WORLD...OR THE DAY AT LEAST..  JERRY REITH...",
        "I don't know if it's the cactus or the tequila or just the unique combination of ingredients, but the flavour of this hot sauce makes it one of a kind!  We picked up a bottle once on a trip we were on and brought it back home with us and were totally blown away!  When we realized that we simply couldn't find it anywhere in our city we were bummed.<br /><br />Now, because of the magic of the internet, we have a case of the sauce and are ecstatic because of it.<br /><br />If you love hot sauce..I mean really love hot sauce, but don't want a sauce that tastelessly burns your throat, grab a bottle of Tequila Picante Gourmet de Inclan.  Just realize that once you taste it, you will never want to use any other sauce.<br /><br />Thank you for the personal, incredible service!",
        "Product received is as advertised.<br /><br /><a href='http://www.amazon.com/gp/product/B001GVISJM'>Twizzlers, Strawberry, 16-Ounce Bags (Pack of 6)</a>",
        "this was sooooo deliscious but too bad i ate em too fast and gained 2 pds! my fault",
        "Deal was awesome!  Arrived before Halloween as indicated and was enough to satisfy trick or treaters.  I love the quality of this product and it was much less expensive than the local store's candy.",
        "I love these.........very tasty!!!!!!!!!!!  Infact, I think I am addicted to them.<br />Buying them in packs of 6 bags - is very reasonable than going to Target and getting a bag.  Savings are about a $1.00 a bag.  I use subscribe and save on these and several other product.  I love subscribe and save!!!!!!!!!!!",
        "I LOVE spicy ramen, but for whatever reasons this thing burns my stomach badly and the burning sensation doesn't go away for like 3 hours! Not sure if that is healthy or not .... and you can buy this at Walmart for $0.28, way cheaper than Amazon.",
        "Makes a tasty, super easy meal, fast. BUT high in calories.<br /><br />The instructions say to saute the veggies first but I recommend cooking the chicken first. The chicken takes longer to cook and the raw chicken ontop of veggies just makes a slimy mess. I made it with snow peas and carrots only. I dont like the little corn.  Added some red pepper flakes for heat and served ontop of rice.  It came out wonderful! Dinner on the table in less than 30mins.",
        "Love this sugar.  I also get muscavado sugar and they are both great to use in place of regular white sugar. Recommend!",
        "This is just Fantastic Chicken Noodle soup, the best I have ever eaten, with large hearty chunks of chicken,and vegetables and nice large noodles. This soup is just so full bodied, and is seasoned just right.  I am so glad Amazon carries this product.  I just can't find it here in Vermont."]

## Import

In [19]:
import pandas as pd
import re
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from nltk.stem import PorterStemmer

nltk.download("stopwords")
nltk.download("punkt")
nltk.download('wordnet')

lemmatizer = WordNetLemmatizer()
stemmer = PorterStemmer()

df= pd.DataFrame(documents,columns=['sentence'])
df

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


Unnamed: 0,sentence
0,Product arrived labeled as Jumbo Salted Peanut...
1,I WAS VISITING MY FRIEND NATE THE OTHER MORNIN...
2,I don't know if it's the cactus or the tequila...
3,Product received is as advertised.<br /><br />...
4,this was sooooo deliscious but too bad i ate e...
5,Deal was awesome! Arrived before Halloween as...
6,I love these.........very tasty!!!!!!!!!!! In...
7,"I LOVE spicy ramen, but for whatever reasons t..."
8,"Makes a tasty, super easy meal, fast. BUT high..."
9,Love this sugar. I also get muscavado sugar a...


## Text preprocessing

In [20]:
def decontracting(phrase):
    phrase = re.sub(r"can\'t", "can not", phrase)
    phrase = re.sub(r"n\'t", " not", phrase)
    return phrase

def preprocess_sentence(text):
    #remove tags, punctuations, numbers
    text = re.sub(r'<[^>]+>', '', text)
    text = re.sub(r"[^\w\s]", "", text)
    text = re.sub(r"\d+", "", text)

    #lowercase
    text = text.lower()

    #decontracting
    text = decontracting(text)
    return text

def preprocess(sentence):
    sentence = preprocess_sentence(sentence)

    #tokenization
    tokens = word_tokenize(sentence)

    #stopword removal
    tokens = [word for word in tokens if word not in stop_words]

    #lemmatization
    tokens = [lemmatizer.lemmatize(word) for word in tokens]

    #stemming
    tokens = [stemmer.stem(word) for word in tokens]
    return tokens

df['sentence'] = df['sentence'].apply(preprocess)

df

Unnamed: 0,sentence
0,"[product, arriv, label, jumbo, salt, peanutsth..."
1,"[visit, friend, nate, morn, coffe, came, stora..."
2,"[dont, know, cactu, tequila, uniqu, combin, in..."
3,"[product, receiv, advertisedtwizzl, strawberri..."
4,"[sooooo, delisci, bad, ate, em, fast, gain, pd..."
5,"[deal, awesom, arriv, halloween, indic, enough..."
6,"[love, theseveri, tasti, infact, think, addict..."
7,"[love, spici, ramen, whatev, reason, thing, bu..."
8,"[make, tasti, super, easi, meal, fast, high, c..."
9,"[love, sugar, also, get, muscavado, sugar, gre..."


# Question 2: DataFrame handling
***For Question 2, you may NOT import any additional libraries/packages to solve the questions.***



*   Please download the datasets from the following link, https://www.kaggle.com/aaron7sun/stocknews
*   Save the downloaded files on your own drive and load the file for later use.
- There are two channels of data provided in this dataset:

  - **News data:** Crawled historical news headlines from Reddit WorldNews Channel . They are ranked by reddit users' votes, and only the top 25 headlines are considered for a single date. (Range: 2008-06-08 to 2016-07-01)

  - **Stock data:** Dow Jones Industrial Average (DJIA) is used to "prove the concept". (Range: 2008-08-08 to 2016-07-01)



In [21]:
from google.colab import drive
drive.mount('/content/drive', force_remount=True)

Mounted at /content/drive


In [22]:
import pandas as pd
corpus_root = 'drive/My Drive/Colab Notebooks/datasets/'

In [23]:
news_df = pd.read_csv(corpus_root+'RedditNews.csv')
stock_df = pd.read_csv(corpus_root+'upload_DJIA_table.csv')

In [24]:
news_df.head()

Unnamed: 0,Date,News
0,2016-07-01,A 117-year-old woman in Mexico City finally re...
1,2016-07-01,IMF chief backs Athens as permanent Olympic host
2,2016-07-01,"The president of France says if Brexit won, so..."
3,2016-07-01,British Man Who Must Give Police 24 Hours' Not...
4,2016-07-01,100+ Nobel laureates urge Greenpeace to stop o...


In [25]:
stock_df.head()

Unnamed: 0,Date,Open,High,Low,Close,Volume,Adj Close
0,2016-07-01,17924.240234,18002.380859,17916.910156,17949.369141,82160000,17949.369141
1,2016-06-30,17712.759766,17930.609375,17711.800781,17929.990234,133030000,17929.990234
2,2016-06-29,17456.019531,17704.509766,17456.019531,17694.679688,106380000,17694.679688
3,2016-06-28,17190.509766,17409.720703,17190.509766,17409.720703,112190000,17409.720703
4,2016-06-27,17355.210938,17355.210938,17063.080078,17140.240234,138740000,17140.240234



## 2-1. Define a *funtion* aims to know the weekly (7 days) Stock **Close value** trend. (80%)
- Implement a *funtion* that determines the weekly (7 days) Stock **Close value** trend using the Stock data and **then record the trend in the News data in a new column "label" with 0, 1 and -1**. For example,
  - On 2016-06-13, the market closed at 17732.480469. Seven days later, 2016-06-20, the market closed **higher** at 17804.869141. In this scenario, all entries corresponding to 2016-06-13 in the News data will be **marked 1** in the "label" column. Dates 2016-06-14 and 2016-06-15 will also be marked 1 because 2016-06-21 and 2016-06-22 closed higher, respectively.
  - On the other hand, 2016-06-17 will be **marked 0** because 2016-06-24 (7 days later) closed **lower**.
  - If a given date does not have a corresponding date for 7 days later, the given date will be **marked -1**.




In [29]:
from datetime import timedelta

def label(stock_data):
    stock_data['Date'] = pd.to_datetime(stock_data['Date'])
    # Init
    stock_data['label'] = -1

    # Create a dictionary（for faster search, O(1)）
    close_values = dict(zip(stock_data['Date'], stock_data['Close']))

    # Iterate
    for i, row in stock_data.iterrows():
        current_date = row['Date']
        target_date = current_date + timedelta(days=7)

        if target_date in close_values:
            if close_values[target_date] > row['Close']:
                stock_data.at[i, 'label'] = 1  # Market higher
            else:
                stock_data.at[i, 'label'] = 0  # Market lower or unchanged

    return stock_data


In [30]:
labeled_stock_data = label(stock_df)

labeled_stock_data.head(30)

Unnamed: 0,Date,Open,High,Low,Close,Volume,Adj Close,label
0,2016-07-01,17924.240234,18002.380859,17916.910156,17949.369141,82160000,17949.369141,-1
1,2016-06-30,17712.759766,17930.609375,17711.800781,17929.990234,133030000,17929.990234,-1
2,2016-06-29,17456.019531,17704.509766,17456.019531,17694.679688,106380000,17694.679688,-1
3,2016-06-28,17190.509766,17409.720703,17190.509766,17409.720703,112190000,17409.720703,-1
4,2016-06-27,17355.210938,17355.210938,17063.080078,17140.240234,138740000,17140.240234,-1
5,2016-06-24,17946.630859,17946.630859,17356.339844,17400.75,239000000,17400.75,1
6,2016-06-23,17844.109375,18011.070312,17844.109375,18011.070312,98070000,18011.070312,0
7,2016-06-22,17832.669922,17920.160156,17770.359375,17780.830078,89440000,17780.830078,0
8,2016-06-21,17827.330078,17877.839844,17799.800781,17829.730469,85130000,17829.730469,0
9,2016-06-20,17736.869141,17946.359375,17736.869141,17804.869141,99380000,17804.869141,0


## 2-2. Map your label to news data (20%)

In [33]:
def news_label(news_df, labeled_stock_data):

    news_df['Date'] = pd.to_datetime(news_df['Date'])
    labeled_stock_data['Date'] = pd.to_datetime(labeled_stock_data['Date'])

    #left merge
    merged_df = pd.merge(news_df, labeled_stock_data[['Date', 'label']], on='Date', how='left')

    merged_df['label'].fillna(-1, inplace=True)

    merged_df['label'] = merged_df['label'].astype(int)

    return merged_df

labeled_news_df = news_label(news_df, labeled_stock_data)
labeled_news_df.head(30)

Unnamed: 0,Date,News,label
0,2016-07-01,A 117-year-old woman in Mexico City finally re...,-1
1,2016-07-01,IMF chief backs Athens as permanent Olympic host,-1
2,2016-07-01,"The president of France says if Brexit won, so...",-1
3,2016-07-01,British Man Who Must Give Police 24 Hours' Not...,-1
4,2016-07-01,100+ Nobel laureates urge Greenpeace to stop o...,-1
5,2016-07-01,Brazil: Huge spike in number of police killing...,-1
6,2016-07-01,Austria's highest court annuls presidential el...,-1
7,2016-07-01,"Facebook wins privacy case, can track any Belg...",-1
8,2016-07-01,Switzerland denies Muslim girls citizenship af...,-1
9,2016-07-01,China kills millions of innocent meditators fo...,-1


# Question 3: Compute cosine similarity of TF-IDF (term frequency–inverse document frequency)
***For Question 3, you may NOT import any additional libraries/packages to solve the questions.***
-  **Cosine similarity** is a measure of similarity between two non-zero vectors of an inner product space that measures the cosine of the angle between them.
- **TF-IDF** is a numerical statistic that is intended to reflect how important a word is to a document in a collection or corpus








## 3-1. Cosine similarity and Euclidean distance are two commonly used metrics to quantify the similarity between two vectors in a vector space model. Explain the differences and implications when we interpret the similarity scores measured by these two metrics. Use examples in your answers. (20%)

## **Cosine similarity**
指的是兩個向量在方向上的相似性，不考慮大小

*   1：向量為完全相同方向
*  -1：向量為完全相反方向
*   0：兩向量獨立

$\text{Cosine Similarity}(\vec{a}, \vec{b}) = \frac{\vec{a} \cdot \vec{b}}{\|\vec{a}\| \|\vec{b}\|}$

---

## **Euclidean distance**
指的是兩個向量之間的實際距離，不考慮方向

$\text{Euclidean Distance}(\vec{a}, \vec{b}) = \sqrt{\sum_{i=1}^{n} (a_i - b_i)^2}
$

---

## **Example**

*   向量A: (1,2)
*   向量B: (2,4)
*   向量C: (-2,-2)

### Cosine similarity

*   A與B的Cosine similarity：$\frac{(1\times2) + (2\times4)}{\sqrt{(1^2 + 2^2)} \sqrt{(2^2 + 4^2)}} = \frac{2 + 8}{\sqrt{5} \sqrt{20}} = \frac{10}{\sqrt{100}} = 1$


*   A與C的Cosine similarity：$\frac{(1\times-2) + (2\times-2)}{\sqrt{(1^2 + 2^2)} \sqrt{((-2)^2 + (-2)^2)}} = \frac{-2 - 4}{\sqrt{5} \sqrt{8}} = \frac{-6}{\sqrt{40}}$

A與B的方向完全相同；A與C的方向相反（但並不是完全相反）

### Euclidean distance

*   A與B的Euclidean distance：$\sqrt{(2-1)^2 + (4-2)^2} = \sqrt{1 + 4} = \sqrt{5}$


*   A與C的Euclidean distance：$\sqrt{(-2-1)^2 + (-2-2)^2} = \sqrt{9 + 16} = \sqrt{25} = 5$

A與B之間的距離小於A與C，代表A與B相對更加接近

---

## **refered resources**
[歐氏距離與餘弦相似度](https://www.cnblogs.com/HuZihu/p/10178165.html)


## 3-2. Define a converting function to compute tf-idf vector from a list of ducoments. (40%)

In [34]:
documents = ['terrible service this time','terrible terrible service','most terrible service','terrible service and experience','what a terrible service','so terrible service experience','what a terrible disappointment','what a terrible place','this time it was so horrible','the staff was horrible']

In [35]:
'''
Answer here
you can define other functions to support the defined function if you need.

TF-IDF dataframe show as the following table.
'''
import numpy as np
import math
def computeTFIDF(documents):
    wordSet = set(word for doc in documents for word in doc.split())

    # TF every document
    wordDictList = []
    for doc in documents:
        wordDict = dict.fromkeys(wordSet, 0)
        wordList = doc.split()
        docCount = len(wordList)
        for word in wordList:
            wordDict[word] += 1 / float(docCount)
        wordDictList.append(wordDict)

    # IDF every document
    lens = len(documents)
    idfDict = dict.fromkeys(wordSet, 0)
    for doc in wordDictList:
        for word, count in doc.items():
            if count > 0:
                idfDict[word] += 1

    idfDict = {word: math.log10(lens / float(count)) for word, count in idfDict.items()}

    # TF-IDF
    tfidfList = []
    for doc in wordDictList:
        tfidf = {word: tf * idfDict[word] for word, tf in doc.items()}
        tfidfList.append(tfidf)

    return tfidfList

In [36]:
import pandas as pd
tf_idf_list = computeTFIDF(documents)
df = pd.DataFrame(computeTFIDF(documents))

df

Unnamed: 0,time,most,the,so,disappointment,it,staff,terrible,what,service,was,horrible,experience,and,this,place,a
0,0.174743,0.0,0.0,0.0,0.0,0.0,0.0,0.024228,0.0,0.055462,0.0,0.0,0.0,0.0,0.174743,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.064607,0.0,0.07395,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.333333,0.0,0.0,0.0,0.0,0.0,0.032303,0.0,0.07395,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.024228,0.0,0.055462,0.0,0.0,0.174743,0.25,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.024228,0.13072,0.055462,0.0,0.0,0.0,0.0,0.0,0.0,0.13072
5,0.0,0.0,0.0,0.174743,0.0,0.0,0.0,0.024228,0.0,0.055462,0.0,0.0,0.174743,0.0,0.0,0.0,0.0
6,0.0,0.0,0.0,0.0,0.25,0.0,0.0,0.024228,0.13072,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.13072
7,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.024228,0.13072,0.0,0.0,0.0,0.0,0.0,0.0,0.25,0.13072
8,0.116495,0.0,0.0,0.116495,0.0,0.166667,0.0,0.0,0.0,0.0,0.116495,0.116495,0.0,0.0,0.116495,0.0,0.0
9,0.0,0.0,0.25,0.0,0.0,0.0,0.25,0.0,0.0,0.0,0.174743,0.174743,0.0,0.0,0.0,0.0,0.0


## **referred resources**
[TF-IDF](https://medium.com/datamixcontent-lab/%E6%96%87%E6%9C%AC%E5%88%86%E6%9E%90%E5%85%A5%E9%96%80-%E6%A6%82%E5%BF%B5%E7%AF%87-%E7%B5%A6%E6%88%91%E4%B8%80%E6%AE%B5%E8%A9%B1-%E6%88%91%E5%91%8A%E8%A8%B4%E4%BD%A0%E9%87%8D%E9%BB%9E%E5%9C%A8%E5%93%AA-%E5%B0%8D%E6%96%87%E6%9C%AC%E9%87%8D%E9%BB%9E%E5%AD%97%E8%A9%9E%E5%8A%A0%E6%AC%8A%E7%9A%84tf-idf%E6%96%B9%E6%B3%95-f6a2790b4991)

## 3-3. Define a scoring function to compute the cosine similarity between two input vector. (30%)

In [42]:
'''
Answer here
return the cosine similarity between the given two vectors
Apply the function which you designed to all sentences, and show your scoring results as the following table.
'''
def cosine_sim(vec_a, vec_b):
    dot_product = np.dot(vec_a, vec_b)
    norm_a = np.linalg.norm(vec_a)
    norm_b = np.linalg.norm(vec_b)
    score = dot_product / (norm_a * norm_b)
    return score

cosine_sim([1,2],[3,4])

0.9838699100999074

## 3-4. Show the cross comparation table for the given sentences. (10%)

In [40]:
# Use previous df and cosine_sim function

tf_idf_array = df.to_numpy()

similarity_matrix = np.zeros((len(tf_idf_array), len(tf_idf_array)))

for i in range(len(tf_idf_array)):
    for j in range(len(tf_idf_array)):
        similarity_matrix[i, j] = cosine_sim(tf_idf_array[i], tf_idf_array[j])

cosine_similarity_df = pd.DataFrame(similarity_matrix)

cosine_similarity_df

Unnamed: 0,0,1,2,3,4,5,6,7,8,9
0,1.0,0.226813,0.055972,0.046299,0.074014,0.056587,0.007397,0.007397,0.517451,0.0
1,0.226813,1.0,0.224349,0.185576,0.296664,0.226813,0.051111,0.051111,0.0,0.0
2,0.055972,0.224349,1.0,0.045796,0.073209,0.055972,0.007317,0.007317,0.0,0.0
3,0.046299,0.185576,0.045796,1.0,0.060557,0.432244,0.006053,0.006053,0.0,0.0
4,0.074014,0.296664,0.073209,0.060557,1.0,0.074014,0.57302,0.57302,0.0,0.0
5,0.056587,0.226813,0.055972,0.432244,0.074014,1.0,0.007397,0.007397,0.258725,0.0
6,0.007397,0.051111,0.007317,0.006053,0.57302,0.007397,1.0,0.357407,0.0,0.0
7,0.007397,0.051111,0.007317,0.006053,0.57302,0.007397,0.357407,1.0,0.0,0.0
8,0.517451,0.0,0.0,0.0,0.0,0.258725,0.0,0.0,1.0,0.305206
9,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.305206,1.0
