# Embeddings
Word embeddings are numerical representations of words in a high-dimensional vector space that capture semantic relationships and contextual information. These embeddings are essential in natural language processing (NLP) tasks, enabling computers to understand and process textual data more effectively. One common method to create word embeddings is Word2Vec, which utilizes neural networks to map words to dense vectors in such a way that similar words are close together in the vector space. Another popular approach is the use of pre-trained embeddings like Word2Vec, GloVe, or FastText, which are learned on large text corpora and can be fine-tuned for specific NLP tasks. Additionally, TF-IDF (Term Frequency-Inverse Document Frequency) can be used to create embeddings by representing the importance of words in documents relative to a corpus. These embeddings are valuable for various NLP tasks, such as sentiment analysis, text classification, machine translation, and more, as they provide a compact and meaningful representation of words that captures their semantic properties.

In [1]:
# !pip install nltk
import numpy as np
import nltk
from nltk.tokenize import word_tokenize 

# nltk.download('punkt')

In [19]:
#Using the NEWS QA dataset provided in TASK-2C

text=[
    '''LOS ANGELES, California (CNN) -- Fans wishing to attend singer Michael Jackson's memorial service next week will have to register for the 11,000 free tickets, organizers said Thursday. Michael Jackson is shown rehearsing at the Staples Center on June 23, two days before his death. Details on how to register for the 10 a.m. (1 p.m. ET) service at the 20,000-seat Staples Center in Los Angeles, California, Tuesday are to be announced Friday. Jackson's family will hold a private ceremony before the public memorial service, his brother said Thursday. Speaking to CNN's Larry King, Jermaine Jackson said the ceremony will be held Tuesday morning, but he did not say where. Jackson rehearsed at Staples Center two nights before he died, and he appeared healthy in a video clip of the rehearsal obtained by CNN. Jackson died June 25 after collapsing at his rented home in Los Angeles. AEG, promoter of Jackson's planned London, England, shows, released the short video of Jackson rehearsing in the arena on June 23. Jackson sang "They Don't Care About Us," a song from his "HIStory" album, as he danced along with eight male dancers. Watch Jackson rehearse » Jackson did not specify where he wished to be buried in a 2002 will, which was filed in court Wednesday. Watch CNN's Anderson Cooper talk about his interview with AEG » More information emerged Thursday about how Jackson's estate will be shared, which his will estimated in 2002 as being worth $500 million. The family trust created by Jackson to receive all of his assets includes his mother, his children and a list of charities, according to a person with direct knowledge of the contents of the trust. Mother Katherine Jackson's 40 percent share would go to Michael Jackson's three children after her death, the source said. The children -- ages 7, 11 and 12 -- also will share 40 percent of the estate's assets, and the remaining 20 percent will benefit charities designated by the executors of the will, the source said. A judge has delayed for a week, until July 13, a hearing to decide whether Katherine Jackson will remain the temporary guardian of Jackson's children. At a brief talk with reporters Thursday, an attorney for Jackson's ex-wife Debbie Rowe said she "has not reached a final decision" on whether she will challenge Jackson's mother for custody of Jackson's two oldest children, according to her lawyer. A Los Angeles TV station quoted Rowe on Thursday morning saying, "I want my children." Except for the statement to the radio station, she has not publicly indicated whether she would seek custody now that Jackson is dead. Rowe was left out of the will. "I have intentionally omitted to provide for my former wife, Deborah Rowe Jackson," the will said. The will nominated Katherine Jackson, now 79, as the guardian of his children. If Katherine Jackson were to die, "I nominate Diana Ross as guardian," Jackson said in the will, written July 7, 2002. Singer Ross, 65, was a lifelong friend of Jackson's. Watch how the two had a close relationship » There's also a question on when the will's executors should take over control of the late entertainer's assets, which Judge Mitchell Beckloff temporarily placed under Katherine Jackson's control. One man named as executor is John Branca, who represented Jackson from 1980 until 2006 and was hired again before the singer's death. He helped acquire Jackson's music catalog, which is worth millions. The other is music industry executive John McClain, a longtime Jackson friend who has worked with him and his sister Janet. DEA reportedly joins investigation The Drug Enforcement Administration has joined the investigation into Jackson's death, a federal law enforcement official said Wednesday night. And the California State Attorney General's office said Thursday that it is helping the Los Angeles Police Department in its investigation. The attorney general's office said it will assist police in sifting through information in a'''

    ,'''(CNN) -- Top Republican lawmakers Sunday called on President Obama to change his political strategy, arguing that the passage of a massive stimulus bill on a party-line vote showed he has failed to deliver the "change" he promised. Sen. John McCain says the Obama administration is off to a "bad beginning." "If this is going to be bipartisanship, the country's screwed," Sen. Lindsey Graham, R-South Carolina, told ABC's "This Week." "I know bipartisanship when I see it." Sen. John McCain, R-Arizona, said Obama was off to "a bad beginning," out of step with the vow of bipartisanship both men made after Obama beat out the Republican presidential nominee for the White House in November. "It was a bad beginning because it wasn't what we promised the American people, what President Obama promised the American people, that we would sit down together," McCain told CNN's "State of the Union With John King." The $787 billion bill made it through Congress with the support of three Republicans -- Sens. Susan Collins and Olympia Snowe of Maine and Arlen Specter of Pennsylvania. Obama is expected to sign the bill Tuesday in Denver, Colorado. Watch Democratic and GOP analysts debate bipartisanship Â» "This is not 'change we can believe in,' " Graham, a member of the Senate Banking Committee, told ABC. He said Democrats "rammed it through the House" after starting out "with the idea, 'We won -- we write the bill.' " But Obama's spokesman insisted the stimulus is a bipartisan success. Speaking to CBS' "Face the Nation," White House spokesman Robert Gibbs said, "We're happy that Congress, in a bipartisan way, took steps to make whatever happens in this recession easier to take for the American people." iReport.com: Share your thoughts on the stimulus plan And on CNN's "State of the Union," Gibbs said, "I think what you saw from this president was an unprecedented effort to reach out to Republicans. Not just in meetings at the White House, but you had the president drive up to Capitol Hill to meet with Republicans where they work." McCain fired back. "Look, I appreciate the fact that the president came over and talked to Republicans," he said. "That's not how you negotiate a result. You sit down together in a room with competing proposals. Almost all of our proposals went down on a party-line vote." When the next major piece of legislation aimed at helping the economy recover reaches Congress, McCain said that he hopes "we will sit down together and conduct truly bipartisan negotiations. This was not a bipartisan bill." iReport.com: McCain's actions "totally reprehensible" McCain added, "Republicans were guilty of this kind of behavior. I'm not saying that we did things different. But Americans want us to do things differently, and they want us to work together." Gibbs described things differently. "This president has always worked in a bipartisan fashion," he told King. "He will continue to reach out to Republicans. John, we hope that Republicans will decide they want to reach back."'''
    
    ,'''COLOMBO, Sri Lanka (CNN) -- Sri Lankan soldiers have seized a key rebel stronghold after launching a surprise attack early Sunday morning, the head of Sri Lanka's army announced. Sri Lankan army chief Sarath Fonseka says a key Tamil town has been taken in a national TV broadcast Sunday. Troops crossed a lagoon and entered the town of Mullaittivu before encountering heavy resistance from Tamil fighters, according to the government-run news agency. "Our troops fought their way through a 40 km (25 mile) thick jungle track," Lt. Gen. Sarath Fonseka said in a televised address on Sunday. "This is the long awaited victory and I am happy to say that our heroic forces today captured the Mullaittivu town after 12 years," the Sri Lanka Army chief said. There is no confirmation from the rebels that the strategic garrison has been overtaken. The Liberation Tigers of Tamil Eelam (LTTE) -- commonly known as the Tamil Tigers -- have fought for an independent homeland for the country's ethnic Tamil minority since 1983. The civil war has left more than 70,000 people dead. The rebels gained control over Mullaittivu in 1996 and established a military garrison there, according to the government. In recent days, the military has made significant progress in its campaign to recapture rebel strongholds. Earlier this month, troops regained control of the northern town of Elephant Pass, the point at which mainland Sri Lanka links to the northern Jaffna peninsula. It had been in rebel hands for more than nine years. The re-capture enabled the government to use a highway linking the mainland to the peninsula to move troops and supplies. Previously, it was done by air and sea. "The area that the LTTE has dominated has shrank phenomenally," Sri Lankan High Commissioner to India, C.R Jayasinghe, told CNN. "They lost... about 90 percent of what they had." Despite major government gains, critics point to ongoing civilian casualties resultant from the conflict. "This is an important strategic success for the army, but literally tens of thousands of people, children, are in the line of fire," United Nations spokesman James Elder said in a phone conversation Sunday. "Some Sri Lankan U.N. staff are trapped there," he added. "Convoys are going to the area, delivering emergency supplies, but these are not sufficient for the number of people in need." Sri Lankan authorities are barring journalists and humanitarian aid workers from areas where heavy fighting is taking place. Amnesty International spokesman Shuransu Mishra estimated that "over a quarter of a million of the population, mostly Tamils, are trapped between the two sides." The organization says greater access and protection for aid workers and journalists are needed as news agencies struggle to report an accurate picture of the conflict. "The Sri Lankan authorities are doing little to ensure the safety of the country's media, or to prosecute those responsible for murdering or attacking them," Amnesty International spokeswoman Yolanda Foster said in a written statement on Friday. "They (Sri Lankan authorities) are also directly responsible for subjecting journalists to harassment and interrogation," she said. At least 14 journalists have been killed since the start of 2006, according to the statement. Others have been driven from the country by death threats, or in fear of detention and torture by government authorities, it said.'''
    
    ,'''(CNN) -- Icelandic Prime Minister Johanna Sigurdardottir of the center-left Social Democratic Alliance has claimed victory in general elections triggered by the collapse of the Nordic nation's economy. Sigurdardottir celebrates victory on Saturday night. Sigurdardottir's party, which has headed an interim government since February 1, was on course to win around 30 percent of the vote or 20 parliamentary seats, according to state broadcaster RUV. The Left-Green Movement, the Social Democratic Alliance's coalition ally, was expected to win 14 seats, giving the coalition a controlling 34-seat block in the 63-member Icelandic parliament, the Althing. "I believe this will be our big victory," Sigurdardottir told supporters, according to Reuters.com. "I am touched, proud and humble at this moment when we are experiencing this great, historic victory of the social democratic movement." Sigurdardottir's electoral success marks a change of direction for Iceland, a nation 300,000 people, which has traditionally leaned to the right on political matters. Sigurdardottir, the world's first openly gay leader and Iceland's first female premier, has pledged to take the Atlantic island into the European Union and to join the euro common currency as a viable way to rescue Iceland's suffering economy. But that ambition could bring Sigurdardottir into conflict with the Left-Green Movement which favors a currency union with Norway as an alternative to EU membership. Iceland has been in political turmoil since October, when its currency, stock market and leading banks crashed amid the global financial crisis. The country's Nordic neighbors sent billions of dollars to prop up the economy, as did the International Monetary Fund in its first intervention to support a Western European democracy in decades. But weekly demonstrations -- some verging on riots -- finally forced Prime Minister Geir Haarde and his Independence Party-led center-right coalition to resign en masse on January 26. The Independence Party was projected to win 16 seats in Saturday's vote, according to RUV.'''
]

In [20]:
sentences = []
word_set = []

In [21]:
for sent in text:
    x = [i.lower() for i in word_tokenize(sent) if i.isalpha()]
    sentences.append(x)
    for word in x:
        if word not in word_set:
            word_set.append(word)
 
#Set of vocab 
word_set = set(word_set)
#Total documents in our corpus
total_documents = len(sentences)
 
#Creating an index for each word in our vocab.
index_dict = {} #Dictionary to store index for each word
i = 0
for word in word_set:
    index_dict[word] = i
    i += 1

In [22]:
#Create a count dictionary
 
def count_dict(sentences):
    word_count = {}
    for word in word_set:
        word_count[word] = 0
        for sent in sentences:
            if word in sent:
                word_count[word] += 1
    return word_count
 
word_count = count_dict(sentences)

In [23]:
#Term Frequency
def termfreq(document, word):
    N = len(document)
    occurance = len([token for token in document if token == word])
    return occurance/N

In [24]:
#Inverse Document Frequency
 
def inverse_doc_freq(word):
    try:
        word_occurance = word_count[word]
    except:
        word_occurance = 1
    return np.log(total_documents/word_occurance)

In [25]:
def tf_idf(sentence):
    tf_idf_vec = np.zeros((len(word_set),))
    for word in sentence:
        tf = termfreq(sentence,word)
        idf = inverse_doc_freq(word)
         
        value = tf*idf
        tf_idf_vec[index_dict[word]] = value 
    return tf_idf_vec

In [26]:
#TF-IDF Encoded text corpus
vectors = []
for sent in sentences:
    vec = tf_idf(sent)
    vectors.append(vec)
 
print(vectors)

[array([0.        , 0.        , 0.        , 0.        , 0.        ,
       0.00222519, 0.        , 0.00092354, 0.        , 0.        ,
       0.00222519, 0.        , 0.00222519, 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.00667557,
       0.00222519, 0.        , 0.        , 0.00222519, 0.00667557,
       0.00445038, 0.        , 0.        , 0.        , 0.        ,
       0.00667557, 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.00222519, 0.        , 0.00222519, 0.00222519,
       0.        , 0.        , 0.        , 0.00222519, 0.        ,
       0.0011126 , 0.        , 0.00222519, 0.00778817, 0.        ,
       0.        , 0.0011126 , 0.        , 0.        , 0.        ,
       0.00222519, 0.00222519, 0.00445038, 0.00046177, 0.        ,
       0.        , 0.        , 0.00222519, 0.        , 0.0011126 ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.00445038, 0.00445038, 0.00222519, 0.        , 0.    

In [46]:
import pandas as pd    
pairs = pd.DataFrame(columns=index_dict.keys())

for i in range(len(text)):
    pairs.loc[i]=vectors[i]
pairs

Unnamed: 0,headed,been,bill,things,minister,details,jayasinghe,after,saturday,commonly,...,created,forced,because,actions,starting,years,says,room,how,join
0,0.0,0.0,0.0,0.0,0.0,0.002225,0.0,0.000924,0.0,0.0,...,0.002225,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.003338,0.0
1,0.0,0.0,0.014321,0.008593,0.0,0.0,0.0,0.001189,0.0,0.0,...,0.0,0.0,0.002864,0.002864,0.002864,0.0,0.001432,0.002864,0.001432,0.0
2,0.0,0.006551,0.0,0.0,0.0,0.0,0.002621,0.001088,0.0,0.002621,...,0.0,0.0,0.0,0.0,0.0,0.005241,0.002621,0.0,0.0,0.0
3,0.004683,0.002342,0.0,0.0,0.009367,0.0,0.0,0.0,0.009367,0.0,...,0.0,0.004683,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.004683


In [56]:
stop_words = pairs.columns[( pairs== 0.0).all()]
stop_words

Index(['that', 'has', 'to', 'and', 'i', 'in', 'for', 'an', 'a', 'cnn', 'but',
       'at', 'of', 'was', 'the', 'on'],
      dtype='object')

In [59]:
matrix=pairs.drop(columns=stop_words)
matrix

Unnamed: 0,headed,been,bill,things,minister,details,jayasinghe,after,saturday,commonly,...,created,forced,because,actions,starting,years,says,room,how,join
0,0.0,0.0,0.0,0.0,0.0,0.002225,0.0,0.000924,0.0,0.0,...,0.002225,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.003338,0.0
1,0.0,0.0,0.014321,0.008593,0.0,0.0,0.0,0.001189,0.0,0.0,...,0.0,0.0,0.002864,0.002864,0.002864,0.0,0.001432,0.002864,0.001432,0.0
2,0.0,0.006551,0.0,0.0,0.0,0.0,0.002621,0.001088,0.0,0.002621,...,0.0,0.0,0.0,0.0,0.0,0.005241,0.002621,0.0,0.0,0.0
3,0.004683,0.002342,0.0,0.0,0.009367,0.0,0.0,0.0,0.009367,0.0,...,0.0,0.004683,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.004683


In [58]:
test=['have a great day']