#### 1. Import nltk and download the ‘stopwords’ and ‘punkt’ packages
Difficulty Level : L1

Q. Import nltk and necessary packages

In [None]:
import nltk
nltk.download('punkt')
nltk.download('stop')
nltk.download('stopwords')

#### 2. Import spacy and load the language model
Difficulty Level : L1

Q. Import spacy library and load ‘en_core_web_sm’ model for english language. Load ‘xx_ent_wiki_sm’ for multi language support.

In [None]:
import spacy
nlp = spacy.load("en_core_web_sm")
nlp

#### 3. How to tokenize a given text?
Difficulty Level : L1

Q. Print the tokens of the given text document

In [None]:
text="Last week, the University of Cambridge shared its own research that shows if everyone wears a mask outside home,dreaded ‘second wave’ of the pandemic can be avoided."

Desired Output :
```
Last
week
,
the
University
of
Cambridge
shared
...(truncated)...
```

In [None]:
# Method 1
# Tokeniation with nltk
import nltk
tokens=nltk.word_tokenize(text)
for token in tokens:
  print(token)

In [None]:
# Method 2
# Tokenization with spaCy
import spacy
nlp=spacy.load("en_core_web_sm")
doc=nlp(text)
for token in doc:
  print(token.text)

#### 4. How to get the sentences of a text document?
Difficulty Level : L1

Q. Print the sentences of the given text document

In [None]:
text="""The outbreak of coronavirus disease 2019 (COVID-19) has created a global health crisis that has had a deep impact on the way we perceive our world and our everyday lives. Not only the rate of contagion and patterns of transmission threatens our sense of agency, but the safety measures put in place to contain the spread of the virus also require social distancing by refraining from doing what is inherently human, which is to find solace in the company of others. Within this context of physical threat, social and physical distancing, as well as public alarm, what has been (and can be) the role of the different mass media channels in our lives on individual, social and societal levels? Mass media have long been recognized as powerful forces shaping how we experience the world and ourselves. This recognition is accompanied by a growing volume of research, that closely follows the footsteps of technological transformations (e.g. radio, movies, television, the internet, mobiles) and the zeitgeist (e.g. cold war, 9/11, climate change) in an attempt to map mass media major impacts on how we perceive ourselves, both as individuals and citizens. Are media (broadcast and digital) still able to convey a sense of unity reaching large audiences, or are messages lost in the noisy crowd of mass self-communication? """

Desired Output :
```
The outbreak of coronavirus disease 2019 (COVID-19) has created a global health crisis that has had a deep impact on the way we perceive our world and our everyday lives.
Not only the rate of contagion and patterns of transmission threatens our sense of agency, but the safety measures put in place to contain the spread of the virus also require social distancing by refraining from doing what is inherently human, which is to find solace in the company of others.
Within this context of physical threat, social and physical distancing, as well as public alarm, what has been (and can be)
...(truncated)...
```

In [None]:
# Tokenizing the text into sentences with spaCy
import spacy
nlp=spacy.load("en_core_web_sm")
doc=nlp(text)
for sentence in doc.sents:
  print(sentence)

In [None]:
# Extracting sentences with nltk
nltk.sent_tokenize(text)

#### 5. How to tokenize a text using the `transformers` package?
Difficulty Level : L1

Q. Tokenize the given text in encoded form using the tokenizer of Huggingface’s transformer package.

In [None]:
text="I love spring season. I go hiking with my friends"

Desired Output :
```
[101, 1045, 2293, 3500, 2161, 1012, 1045, 2175, 13039, 2007, 2026, 2814, 102]
[CLS] i love spring season. i go hiking with my friends [SEP]
```

In [None]:
# Import tokenizer from transfromers
from transformers import AutoTokenizer
import warnings
warnings.filterwarnings("ignore", category=DeprecationWarning) 

# Initialize the tokenizer
tokenizer=AutoTokenizer.from_pretrained('bert-base-uncased')

# Encoding with the tokenizer
inputs=tokenizer.encode(text)
print(inputs)
print(tokenizer.decode(inputs))

#### 6. How to tokenize text with stopwords as delimiters?
Difficulty Level : L2

Q. Tokenize the given text with stop words (“is”,”the”,”was”) as delimiters. Tokenizing this way identifies meaningful phrases. Sometimes, useful for topic modeling

In [None]:
text = "Walter was feeling anxious. He was diagnosed today. He probably is the best person I know."

Expected Output :
```
['Walter',
 'feeling anxious',
 'He',
 'diagnosed today',
 'He probably',
 'best person I know']
```


In [None]:
# Solution
text = "Walter was feeling anxious. He was diagnosed today. He probably is the best person I know."

stop_words_and_delims = ['was', 'is', 'the', '.', ',', '-', '!', '?']
for r in stop_words_and_delims:
    text = text.replace(r, 'DELIM')

words = [t.strip() for t in text.split('DELIM')]
words_filtered = list(filter(lambda a: a not in [''], words))
words_filtered

##### 7. How to remove stop words in a text ?
Difficulty Level : L1

Q. Remove all the stopwords ( ‘a’ , ‘the’, ‘was’…) from the text

In [None]:
text="""the outbreak of coronavirus disease 2019 (COVID-19) has created a global health crisis that has had a deep impact on the way we perceive our world and our everyday lives. Not only the rate of contagion and patterns of transmission threatens our sense of agency, but the safety measures put in place to contain the spread of the virus also require social distancing by refraining from doing what is inherently human, which is to find solace in the company of others. Within this context of physical threat, social and physical distancing, as well as public alarm, what has been (and can be) the role of the different mass media channels in our lives on individual, social and societal levels? Mass media have long been recognized as powerful forces shaping how we experience the world and ourselves. This recognition is accompanied by a growing volume of research, that closely follows the footsteps of technological transformations (e.g. radio, movies, television, the internet, mobiles) and the zeitgeist (e.g. cold war, 9/11, climate change) in an attempt to map mass media major impacts on how we perceive ourselves, both as individuals and citizens. Are media (broadcast and digital) still able to convey a sense of unity reaching large audiences, or are messages lost in the noisy crowd of mass self-communication?"""

Desired Output :
```
'outbreak coronavirus disease 2019 ( COVID-19 ) created global health crisis deep impact way perceive world everyday lives . rate contagion patterns transmission threatens sense agency , safety measures place contain spread virus require social distancing refraining inherently human , find solace company . context physical threat , social physical distancing , public alarm , ( ) role different mass media channels lives individual , social societal levels ? Mass media long recognized powerful forces shaping experience world . recognition accompanied growing volume research , closely follows footsteps technological transformations ( e.g. radio , movies , television , internet , mobiles ) zeitgeist ( e.g. cold war , 9/11 , climate change ) attempt map mass media major impacts perceive , individuals citizens . media ( broadcast digital ) able convey sense unity reaching large audiences , messages lost noisy crowd mass self - communication ?'
```

In [None]:
# Method 1
# Removing stopwords in nltk

import nltk
from nltk.corpus import stopwords
my_stopwords=set(stopwords.words('english'))
new_tokens=[]

# Tokenization using word_tokenize()
all_tokens=nltk.word_tokenize(text)

for token in all_tokens:
  if token not in my_stopwords:
    new_tokens.append(token)


" ".join(new_tokens)

In [None]:
# Method 2
# Removing stopwords in spaCy

import spacy
nlp = spacy.load("en_core_web_sm")
doc=nlp(text)
new_tokens=[]

# Using is_stop attribute of each token to check if it's a stopword
for token in doc:
  if token.is_stop==False:
    new_tokens.append(token.text)

" ".join(new_tokens)

#### 8. How to add custom stop words in spaCy ?
Difficulty Level : L1

Q. Add the custom stopwords “NIL” and “JUNK” in spaCy and remove the stopwords in below text

In [None]:
text=" Jonas was a JUNK great guy NIL Adam was evil NIL Martha JUNK was more of a fool "

Expected Output :
```
'Jonas great guy Adam evil Martha fool'
```

In [None]:
import spacy
nlp = spacy.load("en_core_web_sm")

# list of custom stop words
customize_stop_words = ['NIL','JUNK']

# Adding these stop words
for w in customize_stop_words:
    nlp.vocab[w].is_stop = True
doc = nlp(text.strip())
tokens = [token.text for token in doc if not token.is_stop]

" ".join(tokens)

#### 9. How to remove punctuations ?
Difficulty Level : L1

Q. Remove all the punctuations in the given text

In [None]:
text="The match has concluded !!! India has won the match . Will we fin the finals too ? !"

Desired Output :
```
'The match has concluded India has won the match Will we fin the finals too'
```

In [None]:
# Method 1
# Removing punctuations in spaCy
import spacy
nlp = spacy.load("en_core_web_sm")

doc=nlp(text)
new_tokens=[]
# Check if a token is a punctuation through is_punct attribute
for token in doc:
  if token.is_punct==False:
    new_tokens.append(token.text)

" ".join(new_tokens)

In [None]:
# Method 2
# Removing punctuation in nltk with RegexpTokenizer
import nltk
tokenizer=nltk.RegexpTokenizer(r"\w+")

tokens=tokenizer.tokenize(text)
" ".join(tokens)

#### 10. How to perform stemming
Difficulty Level : L2

Q. Perform stemming/ convert each token to it’s root form in the given text

In [None]:
text= "Dancing is an art. Students should be taught dance as a subject in schools . I danced in many of my school function. Some people are always hesitating to dance."

Desired Output:
```
'danc is an art . student should be taught danc as a subject in school . I danc in mani of my school function . some peopl are alway hesit to danc .'
```

In [None]:
# Stemming with nltk's PorterStemmer
import nltk
from nltk.stem import PorterStemmer

stemmer=PorterStemmer()
stemmed_tokens=[]
for token in nltk.word_tokenize(text):
  stemmed_tokens.append(stemmer.stem(token))

" ".join(stemmed_tokens)

#### 11. How to lemmatize a given text ?
Difficulty Level : L2

Q. Perform lemmatzation on the given text


Hint: Lemmatization Approaches

In [None]:
text= "Dancing is an art. Students should be taught dance as a subject in schools . I danced in many of my school function. Some people are always hesitating to dance."

Desired Output:
```
'dancing be an art . student should be teach dance as a subject in school . -PRON- dance in many of -PRON- school function . some people be always hesitate to dance .'
```

In [None]:
# Lemmatization using spacy's lemma_ attribute of token
import spacy
nlp=spacy.load("en_core_web_sm")
doc=nlp(text)

lemmatized=[token.lemma_ for token in doc]
" ".join(lemmatized)

#### 12. How to extract usernames from emails ?
Difficulty Level : L2

Q. Extract the usernames from the email addresses present in the text

In [None]:
text= "The new registrations are potter709@gmail.com , elixir101@gmail.com. If you find any disruptions, kindly contact granger111@gamil.com or severus77@gamil.com "

Desired Output :
```
['potter709', 'elixir101', 'granger111', 'severus77']
```

In [None]:
# Using regular expression to extract usernames
import re  

# \S matches any non-whitespace character 
# @ for as in the Email 
# + for Repeats a character one or more times 
usernames= re.findall('(\S+)@', text)     
print(usernames) 

#### 13. How to find the most common words in the text excluding stopwords
Difficulty Level : L2

Q. Extract the top 10 most common words in the given text excluding stopwords.

In [None]:
text="""Junkfood - Food that do no good to our body. And there's no need of them in our body but still we willingly eat them because they are great in taste and easy to cook or ready to eat. Junk foods have no or very less nutritional value and irrespective of the way they are marketed, they are not healthy to consume.The only reason of their gaining popularity and increased trend of consumption is 
that they are ready to eat or easy to cook foods. People, of all age groups are moving towards Junkfood as it is hassle free and often ready to grab and eat. Cold drinks, chips, noodles, pizza, burgers, French fries etc. are few examples from the great variety of junk food available in the market.
 Junkfood is the most dangerous food ever but it is pleasure in eating and it gives a great taste in mouth examples of Junkfood are kurkure and chips.. cold rings are also source of junk food... they shud nt be ate in high amounts as it results fatal to our body... it cn be eated in a limited extend ... in research its found tht ths junk foods r very dangerous fr our health
Junkfood is very harmful that is slowly eating away the health of the present generation. The term itself denotes how dangerous it is for our bodies. Most importantly, it tastes so good that people consume it on a daily basis. However, not much awareness is spread about the harmful effects of Junkfood .
The problem is more serious than you think. Various studies show that Junkfood impacts our health negatively. They contain higher levels of calories, fats, and sugar. On the contrary, they have very low amounts of healthy nutrients and lack dietary fibers. Parents must discourage their children from consuming junk food because of the ill effects it has on one’s health.
Junkfood is the easiest way to gain unhealthy weight. The amount of fats and sugar in the food makes you gain weight rapidly. However, this is not a healthy weight. It is more of fats and cholesterol which will have a harmful impact on your health. Junk food is also one of the main reasons for the increase in obesity nowadays.
This food only looks and tastes good, other than that, it has no positive points. The amount of calorie your body requires to stay fit is not fulfilled by this food. For instance, foods like French fries, burgers, candy, and cookies, all have high amounts of sugar and fats. Therefore, this can result in long-term illnesses like diabetes and high blood pressure. This may also result in kidney failure."""

Desired Output:
```
text= {Junkfood: 10,
 food: 8,
 good: 5,
 harmful : 3
 body: 1,
 need: 1,

 ...(truncated)
```

In [None]:
# Creating spacy doc of the text

import spacy
nlp=spacy.load("en_core_web_sm")
doc=nlp(text)

# Removal of stop words and punctuations
words=[str(token).strip() for token in doc if token.is_stop==False and token.is_punct==False]

freq_dict={}

# Calculating frequency count
for word in words:
  if word not in freq_dict:
    freq_dict[word]=1
  else:
    freq_dict[word]+=1

print(freq_dict)

#### 14. How to do spell correction in a given text ?
Difficulty Level : L2

Q. Correct the spelling errors in the following text

In [None]:
text="He is a gret person. He beleives in bod"

Desired Output:
```
text="He is a great person. He believes in god"
```

In [None]:
# Import textblob
from textblob import TextBlob
import warnings
warnings.filterwarnings("ignore", category=DeprecationWarning) 

# Using textblob's correct() function
text=TextBlob(text)
print(text.correct())

#### 15. How to tokenize tweets ?
Difficulty Level : L2

Q. Clean the following tweet and tokenize them

In [None]:
text=" Having lots of fun #goa #vaction #summervacation. Fancy dinner @Beachbay restro :) "

Desired Output :
```
['Having',
 'lots',
 'of',
 'fun',
 'goa',
 'vaction',
 'summervacation',
 'Fancy',
 'dinner',
 'Beachbay',
 'restro']
```

In [None]:
import re
# Cleaning the tweets
text=re.sub(r'[^\w]', ' ', text)

# Using nltk's TweetTokenizer
from nltk.tokenize import TweetTokenizer
tokenizer=TweetTokenizer()
tokenizer.tokenize(text)

#### 16. How to extract all the nouns in a text?
Difficulty Level : L2

Q. Extract and print all the nouns present in the below text

In [None]:
text="James works at Microsoft. She lives in manchester and likes to play the flute"

Desired Output :
```
James
Microsoft
manchester
flute
```

In [None]:
# Coverting the text into a spacy Doc
import spacy
nlp=spacy.load("en_core_web_sm")
doc=nlp(text)

# Using spacy's pos_ attribute to check for part of speech tags
for token in doc:
  if token.pos_=='NOUN' or token.pos_=='PROPN':
    print(token.text)

#### 17. How to extract all the pronouns in a text?
Difficulty Level : L2

Q. Extract and print all the pronouns in the text

In [None]:
text="John is happy finally. He had landed his dream job finally. He told his mom. She was elated "

Desired Output :
```
 He
 He
 She
```

In [None]:
# Using spacy's pos_ attribute to check for part of speech tags
import spacy
nlp=spacy.load("en_core_web_sm")
doc=nlp(text)

for token in doc:
  if token.pos_=='PRON':
    print(token.text)

#### 18. How to find similarity between two words?
Difficulty Level : L2

Find the similarity between any two words.

In [None]:
word1="amazing"
word2="terrible"
word3="excellent"

Desired Output:
```
similarity between amazing and terrible is 0.46189071343764604
similarity between amazing and excellent is 0.6388207086737778
```

In [None]:
# Convert words into spacy tokens
import spacy
nlp=spacy.load('en_core_web_lg')

token1=nlp(word1)
token2=nlp(word2)
token3=nlp(word3)

# Use similarity() function of tokens
print('similarity between', word1,'and' ,word2, 'is' ,token1.similarity(token2))
print('similarity between', word1,'and' ,word3, 'is' ,token1.similarity(token3))

#### 19. How to find similarity between two documents?
Difficulty Level : L2

Q. Find the similarity between any two text documents

In [None]:
text1="John lives in Canada"
text2="James lives in America, though he's not from there"

Desired Output :
```
0.792817083631068
```

In [None]:
# Method 1
# Finding similarity using spacy library
import spacy
import warnings
warnings.filterwarnings("ignore", category=UserWarning) 

nlp=spacy.load("en_core_web_sm")

doc1=nlp(text1)
doc2=nlp(text2)
print(doc1.similarity(doc2))

In [None]:
# Method 2
from nltk.corpus import wordnet

list1 = text1.split(" ")
list2 = text2.split(" ")

lst = []
for word1 in list1:
    for word2 in list2:
        wordFromList1 = wordnet.synsets(word1)
        wordFromList2 = wordnet.synsets(word2)
        if wordFromList1 and wordFromList2: #Thanks to @alexis' note
            s = wordFromList1[0].wup_similarity(wordFromList2[0])
            lst.append(s)
            
s = 0
for i in lst:
    s += i

print(1- s/len(lst))

#### 20. How to find the cosine similarity of two documents?
Difficulty Level : L3

Q. Find the cosine similarity between two given documents

In [None]:
text1='Taj Mahal is a tourist place in India'
text2='Great Wall of China is a tourist place in china'

Desired Output :
```
[[1.         0.45584231]
 [0.45584231 1.        ]]
```

In [None]:
# Using Vectorizer of sklearn to get vector representation

from sklearn.feature_extraction.text import CountVectorizer
import pandas as pd
from sklearn.metrics.pairwise import cosine_similarity

documents=[text1,text2]

vectorizer=CountVectorizer()
matrix=vectorizer.fit_transform(documents)

# Obtaining the document-word matrix
doc_term_matrix=matrix.todense()
doc_term_matrix

# Computing cosine similarity
df=pd.DataFrame(doc_term_matrix)

print(cosine_similarity(df,df))

#### 21. How to find soft cosine similarity of documents ?
Difficulty Level : L3

Q. Compute the soft cosine similarity of the given documents


Hint: Soft Cosine Similarity

In [None]:
doc_soup = "Soup is a primarily liquid food, generally served warm or hot (but may be cool or cold), that is made by combining ingredients of meat or vegetables with stock, juice, water, or another liquid. "
doc_noodles = "Noodles are a staple food in many cultures. They are made from unleavened dough which is stretched, extruded, or rolled flat and cut into one of a variety of shapes."
doc_dosa = "Dosa is a type of pancake from the Indian subcontinent, made from a fermented batter. It is somewhat similar to a crepe in appearance. Its main ingredients are rice and black gram."
doc_trump = "Mr. Trump became president after winning the political election. Though he lost the support of some republican friends, Trump is friends with President Putin"
doc_election = "President Trump says Putin had no political interference is the election outcome. He says it was a witchhunt by political parties. He claimed President Putin is a friend who had nothing to do with the election"
doc_putin = "Post elections, Vladimir Putin became President of Russia. President Putin had served as the Prime Minister earlier in his political career"

Desired Output :
```
0.5842470477718544
```

In [None]:
import warnings
warnings.filterwarnings("ignore", category=DeprecationWarning) 

# Import and download stopwords from NLTK.
from nltk.corpus import stopwords
from nltk import download
download('stopwords')  # Download stopwords list.
stop_words = stopwords.words('english')

# Preprocess the sentences
def preprocess(sentence):
    return [w for w in sentence.lower().split() if w not in stop_words]

doc_soup = preprocess(doc_soup)
doc_noodles = preprocess(doc_noodles)
doc_dosa = preprocess(doc_dosa)
doc_trump = preprocess(doc_trump)
doc_election = preprocess(doc_election)
doc_putin = preprocess(doc_putin)

# Build a dictionary and an TF-IDF model, convert the sentences to the bag-of-words format
from gensim.corpora import Dictionary
documents = [doc_soup, doc_noodles, doc_dosa, doc_trump, doc_election, doc_putin]
dictionary = Dictionary(documents)

doc_soup = dictionary.doc2bow(doc_soup)
doc_noodles = dictionary.doc2bow(doc_noodles)
doc_dosa = dictionary.doc2bow(doc_dosa)
doc_trump = dictionary.doc2bow(doc_trump)
doc_election = dictionary.doc2bow(doc_election)
doc_putin = dictionary.doc2bow(doc_putin)

from gensim.models import TfidfModel
documents = [doc_soup, doc_noodles, doc_dosa, doc_trump, doc_election, doc_putin]
tfidf = TfidfModel(documents)

doc_soup = tfidf[doc_soup]
doc_noodles = tfidf[doc_noodles]
doc_dosa = tfidf[doc_dosa]
doc_trump = tfidf[doc_trump]
doc_election = tfidf[doc_election]
doc_putin = tfidf[doc_putin]

# Download the FastText model
import gensim.downloader as api
model = api.load('fasttext-wiki-news-subwords-300')
# model = api.load('word2vec-google-news-300')

# Prepare the similarity matrix
from gensim.similarities import SparseTermSimilarityMatrix, WordEmbeddingSimilarityIndex
termsim_index = WordEmbeddingSimilarityIndex(model)
termsim_matrix = SparseTermSimilarityMatrix(termsim_index, dictionary, tfidf)

# Compute SCM using the inner_product method
similarity = termsim_matrix.inner_product(doc_soup, doc_noodles, normalized=(True, True))
print('similarity = %f' % similarity)

In [None]:
# Compare the soft cosines for all documents against each other
import numpy as np
import pandas as pd

sentences = [doc_soup, doc_noodles, doc_dosa, doc_trump, doc_election, doc_putin]

def create_soft_cossim_matrix(sentences):
    len_array = np.arange(len(sentences))
    xx, yy = np.meshgrid(len_array, len_array)
    cossim_mat = pd.DataFrame([[round(termsim_matrix.inner_product(sentences[i],sentences[j], normalized=(True, True)), 2) for i, j in zip(x,y)] for y, x in zip(xx, yy)])
    return cossim_mat

create_soft_cossim_matrix(sentences)

#### 22. How to find similar words using pre-trained Word2Vec?
Difficulty Level : L2

Q. Find all similiar words to “amazing” using Google news Word2Vec.

Desired Output:
```
[('incredible', 0.90),
('awesome', 0.82),
('unbelievable', 0.82),
('fantastic', 0.77),
('phenomenal', 0.76),
('astounding', 0.73),
('wonderful', 0.72),
('unbelieveable', 0.71),
('remarkable', 0.70),
('marvelous', 0.70)]
```

In [None]:
import warnings
warnings.filterwarnings("ignore", category=DeprecationWarning) 

# Import gensim api
import gensim.downloader as api

# Load the pretrained google news word2vec model
word2vec_model300 = api.load('word2vec-google-news-300')

# Using most_similar() function
word2vec_model300.most_similar('amazing')

#### 23. How to compute Word mover distance?
Difficulty Level : L3

Q. Compute the word mover distance between given two texts

In [None]:
sentence_orange = 'Oranges are my favorite fruit'
sent="apples are not my favorite"

Desired Output :
```
5.378
```

In [None]:
import warnings
warnings.filterwarnings("ignore", category=DeprecationWarning) 

# Import and download stopwords from NLTK.
from nltk.corpus import stopwords
from nltk import download
download('stopwords')  # Download stopwords list.
stop_words = stopwords.words('english')

def preprocess(sentence):
    return [w for w in sentence.lower().split() if w not in stop_words]

sentence_orange = preprocess(sentence_orange)
sent = preprocess(sent)

# Importing gensim's model
import gensim.downloader as api
model = api.load('word2vec-google-news-300')

# Computing the word mover distance
distance = model.wmdistance(sent, sentence_orange)
print(distance)

#### 24. How to replace all the pronouns in a text with their respective object names
Difficulty Level : L2

Q. Replace the pronouns in below text by the respective object names

In [None]:
text=" My sister has a dog and she loves him"

Desired Output :
```
[My sister, she]
[a dog, him ]
```

In [None]:
# # NOT WORKING
# # Import neural coref library
# # neuralcoref only works with spacy v2, specifically: spacy==2.1.0, neuralcoref==4.0
# import spacy
# import neuralcoref

# # Add it to the pipeline
# nlp = spacy.load('en')
# neuralcoref.add_to_pipe(nlp)

# # Printing the coreferences
# doc1 = nlp('My sister has a dog. She loves him.')
# print(doc1._.coref_clusters)

In [None]:
import spacy

text = "My sister has a dog and she loves him" 

nlp = spacy.load('en_core_web_sm')
doc = nlp(text)

pronouns = []
objects = []
for token in doc:
    if token.pos_ == "PRON" and token.head.pos_ != "PRON":
        pronouns.append(token.text)
        objects.append(token.head.text)
    elif token.pos_ == "PRON" and token.head.pos_ == "PRON":  
        pronouns.append(token.text)
        objects.append(token.head.head.text)

print(pronouns)
print(objects)

#### 25. How to extract topic keywords using LSA?
Difficulty Level : L3

Q. Extract the topic keywords from the given texts using LSA(Latent Semantic Analysis )

In [None]:
texts= ["""It's all about travel. I travel a lot.  those who do not travel read only a page.” – said Saint Augustine. He was a great travel person. Travelling can teach you more than any university course. You learn about the culture of the country you visit. If you talk to locals, you will likely learn about their thinking, habits, traditions and history as well.If you travel, you will not only learn about foreign cultures, but about your own as well. You will notice the cultural differences, and will find out what makes your culture unique. After retrurning from a long journey, you will see your country with new eyes.""",
        """ You can learn a lot about yourself through travelling. You can observe how you feel beeing far from your country. You will find out how you feel about your homeland.You should travel You will realise how you really feel about foreign people. You will find out how much you know/do not know about the world. You will be able to observe how you react in completely new situations. You will test your language, orientational and social skills. You will not be the same person after returning home.During travelling you will meet people that are very different from you. If you travel enough, you will learn to accept and appreciate these differences. Traveling makes you more open and accepting.""",
        """Some of my most cherished memories are from the times when I was travelling. If you travel, you can experience things that you could never experience at home. You may see beautiful places and landscapes that do not exist where you live. You may meet people that will change your life, and your thingking. You may try activities that you have never tried before.Travelling will inevitably make you more independent and confident. You will realise that you can cope with a lot of unexpected situations. You will realise that you can survive without all that help that is always available for you at home. You will likely find out that you are much stronger and braver than you have expected.""",
        """If you travel, you may learn a lot of useful things. These things can be anything from a new recepie, to a new, more effective solution to an ordinary problem or a new way of creating something.Even if you go to a country where they speak the same language as you, you may still learn some new words and expressions that are only used there. If you go to a country where they speak a different language, you will learn even more.""",
        """After arriving home from a long journey, a lot of travellers experience that they are much more motivated than they were before they left. During your trip you may learn things that you will want to try at home as well. You may want to test your new skills and knowledge. Your experiences will give you a lot of energy.During travelling you may experience the craziest, most exciting things, that will eventually become great stories that you can tell others. When you grow old and look back at your life and all your travel experiences, you will realise how much you have done in your life and your life was not in vain. It can provide you with happiness and satisfaction for the rest of your life.""",
        """The benefits of travel are not just a one-time thing: travel changes you physically and psychologically. Having little time or money isn't a valid excuse. You can travel for cheap very easily. If you have a full-time job and a family, you can still travel on the weekends or holidays, even with a baby. travel  more is likely to have a tremendous impact on your mental well-being, especially if you're no used to going out of your comfort zone. Trust me: travel more and your doctor will be happy. Be sure to get in touch with your physician, they might recommend some medication to accompany you in your travels, especially if you're heading to regions of the globe with potentially dangerous diseases.""",
        """Sure, you probably feel comfortable where you are, but that is just a fraction of the world! If you are a student, take advantage of programs such as Erasmus to get to know more people, experience and understand their culture. Dare traveling to regions you have a skeptical opinion about. I bet that you will change your mind and realize that everything is not so bad abroad.""",
        """ So, travel makes you cherish life. Let's travel more . Share your travel diaries with us too"""
        ]

Desired Output :
```
Topic 0: 
learn new life travelling country feel  
Topic 1: 
life cherish diaries let share experience  
Topic 2: 
feel know time people just regions  
Topic 3: 
time especially cherish diaries let share  
..(truncated)..
```

In [None]:
# Importing the Tf-idf vectorizer from sklearn
from sklearn.feature_extraction.text import TfidfVectorizer

# Defining the vectorizer
vectorizer = TfidfVectorizer(stop_words='english', max_features= 1000,  max_df = 0.5, smooth_idf=True)

# Transforming the tokens into the matrix form through .fit_transform()
matrix= vectorizer.fit_transform(texts)

# SVD represent documents and terms in vectors
from sklearn.decomposition import TruncatedSVD
SVD_model = TruncatedSVD(n_components=10, algorithm='randomized', n_iter=100, random_state=122)
SVD_model.fit(matrix)

# Getting the terms 
terms = vectorizer.get_feature_names_out()

# Iterating through each topic
for i, comp in enumerate(SVD_model.components_):
    terms_comp = zip(terms, comp)
    # sorting the 7 most important terms
    sorted_terms = sorted(terms_comp, key= lambda x:x[1], reverse=True)[:7]
    print("Topic "+str(i)+": ")
    # printing the terms of a topic
    for t in sorted_terms:
        print(t[0],end=' ')
    print(' ')