Skip to content

Commit 907a0ac

Browse files
NLP tutorial: stop words: exercise (#11)
1 parent ae536d2 commit 907a0ac

File tree

2 files changed

+504
-0
lines changed

2 files changed

+504
-0
lines changed
+234
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,234 @@
1+
{
2+
"cells": [
3+
{
4+
"cell_type": "markdown",
5+
"id": "b17f58fa",
6+
"metadata": {
7+
"id": "b17f58fa"
8+
},
9+
"source": [
10+
"### **Stop Words: Exercise**"
11+
]
12+
},
13+
{
14+
"cell_type": "markdown",
15+
"id": "23a26def",
16+
"metadata": {
17+
"id": "23a26def"
18+
},
19+
"source": [
20+
"- **Run this cell to import all necessary packages**"
21+
]
22+
},
23+
{
24+
"cell_type": "code",
25+
"execution_count": 1,
26+
"id": "34f02550",
27+
"metadata": {
28+
"id": "34f02550"
29+
},
30+
"outputs": [],
31+
"source": [
32+
"#import spacy and load the model\n",
33+
"\n",
34+
"import spacy\n",
35+
"nlp = spacy.load(\"en_core_web_sm\")"
36+
]
37+
},
38+
{
39+
"cell_type": "markdown",
40+
"id": "0fe230d8",
41+
"metadata": {
42+
"id": "0fe230d8"
43+
},
44+
"source": [
45+
"**Exercise1:** \n",
46+
"- From a Given Text, Count the number of stop words in it.\n",
47+
"- Print the percentage of stop word tokens compared to all tokens in a given text."
48+
]
49+
},
50+
{
51+
"cell_type": "code",
52+
"execution_count": 1,
53+
"id": "646c9e7a",
54+
"metadata": {
55+
"colab": {
56+
"base_uri": "https://localhost:8080/"
57+
},
58+
"id": "646c9e7a",
59+
"outputId": "7d59339e-bf53-4239-eda5-134e6af42e22"
60+
},
61+
"outputs": [],
62+
"source": [
63+
"text = '''\n",
64+
"Thor: Love and Thunder is a 2022 American superhero film based on Marvel Comics featuring the character Thor, produced by Marvel Studios and \n",
65+
"distributed by Walt Disney Studios Motion Pictures. It is the sequel to Thor: Ragnarok (2017) and the 29th film in the Marvel Cinematic Universe (MCU).\n",
66+
"The film is directed by Taika Waititi, who co-wrote the script with Jennifer Kaytin Robinson, and stars Chris Hemsworth as Thor alongside Christian Bale, Tessa Thompson,\n",
67+
"Jaimie Alexander, Waititi, Russell Crowe, and Natalie Portman. In the film, Thor attempts to find inner peace, but must return to action and recruit Valkyrie (Thompson),\n",
68+
"Korg (Waititi), and Jane Foster (Portman)—who is now the Mighty Thor—to stop Gorr the God Butcher (Bale) from eliminating all gods.\n",
69+
"'''\n",
70+
"\n",
71+
"#step1: Create the object 'doc' for the given text using nlp()\n",
72+
"\n",
73+
"\n",
74+
"\n",
75+
"#step2: define the variables to keep track of stopwords count and total words count\n",
76+
"\n",
77+
"\n",
78+
"\n",
79+
"#step3: iterate through all the words in the document\n",
80+
"\n",
81+
"\n",
82+
"\n",
83+
"#step4: print the count of stop words\n",
84+
"\n",
85+
" \n",
86+
"\n",
87+
"#step5: print the percentage of stop words compared to total words in the text\n"
88+
]
89+
},
90+
{
91+
"cell_type": "markdown",
92+
"id": "vsJaC5a-ldY-",
93+
"metadata": {
94+
"id": "vsJaC5a-ldY-"
95+
},
96+
"source": [
97+
"**Exercise2:** \n",
98+
"\n",
99+
"- Spacy default implementation considers **\"not\"** as a stop word. But in some scenarios removing 'not' will completely change the meaning of the statement/text. For Example, consider these two statements:\n",
100+
"\n",
101+
" - this is a good movie ----> Positive Statement\n",
102+
" - this is not a good movie ----> Negative Statement\n",
103+
"\n",
104+
"- So, after applying stopwords to those 2 texts, both will return **\"good movie\"** and does not respect the polarity/sentiments of text.\n",
105+
" \n",
106+
"- Now, your task is to remove this stop word **\"not\"** in spaCy and help in distinguishing the texts.\n",
107+
"\n",
108+
"\n",
109+
"- **Hint:** GOOGLE IT! Google is your friend.\n",
110+
"\n"
111+
]
112+
},
113+
{
114+
"cell_type": "code",
115+
"execution_count": 2,
116+
"id": "4e9e663a",
117+
"metadata": {
118+
"colab": {
119+
"base_uri": "https://localhost:8080/"
120+
},
121+
"id": "4e9e663a",
122+
"outputId": "72779ead-6cb9-4f92-da54-3e3a882c2069"
123+
},
124+
"outputs": [],
125+
"source": [
126+
"#use this pre-processing function to pass the text and to remove all the stop words and finally get the cleaned form\n",
127+
"def preprocess(text):\n",
128+
" doc = nlp(text)\n",
129+
" no_stop_words = [token.text for token in doc if not token.is_stop]\n",
130+
" return \" \".join(no_stop_words) \n",
131+
"\n",
132+
"\n",
133+
"#Step1: remove the stopword 'not' in spacy\n",
134+
"\n",
135+
"\n",
136+
"\n",
137+
"#step2: send the two texts given above into the pre-process function and store the transformed texts\n",
138+
"\n",
139+
"\n",
140+
"\n",
141+
"#step3: finally print those 2 transformed texts\n"
142+
]
143+
},
144+
{
145+
"cell_type": "markdown",
146+
"id": "RWnHxZy-Fv5S",
147+
"metadata": {
148+
"id": "RWnHxZy-Fv5S"
149+
},
150+
"source": [
151+
"**Exercise3:** \n",
152+
"\n",
153+
"- From a given text, output the **most frequently** used token after removing all the stop word tokens and punctuations in it. \n",
154+
"\n"
155+
]
156+
},
157+
{
158+
"cell_type": "code",
159+
"execution_count": 3,
160+
"id": "GfLMTZmBFlPI",
161+
"metadata": {
162+
"colab": {
163+
"base_uri": "https://localhost:8080/"
164+
},
165+
"id": "GfLMTZmBFlPI",
166+
"outputId": "448095a9-954b-43e9-ad86-da7d48aed72c"
167+
},
168+
"outputs": [],
169+
"source": [
170+
"text = ''' The India men's national cricket team, also known as Team India or the Men in Blue, represents India in men's international cricket.\n",
171+
"It is governed by the Board of Control for Cricket in India (BCCI), and is a Full Member of the International Cricket Council (ICC) with Test,\n",
172+
"One Day International (ODI) and Twenty20 International (T20I) status. Cricket was introduced to India by British sailors in the 18th century, and the \n",
173+
"first cricket club was established in 1792. India's national cricket team played its first Test match on 25 June 1932 at Lord's, becoming the sixth team to be\n",
174+
"granted test cricket status.\n",
175+
"'''\n",
176+
"\n",
177+
"\n",
178+
"#step1: Create the object 'doc' for the given text using nlp()\n",
179+
"\n",
180+
"\n",
181+
"\n",
182+
"#step2: remove all the stop words and punctuations and store all the remaining tokens in a new list\n",
183+
"\n",
184+
"\n",
185+
"\n",
186+
"#step3: create a new dictionary and get the frequency of words by iterating through the list which contains stored tokens \n",
187+
"\n",
188+
"\n",
189+
"\n",
190+
"#step4: get the maximum frequency word\n",
191+
"\n",
192+
"\n",
193+
"\n",
194+
"#step5: finally print the result\n"
195+
]
196+
},
197+
{
198+
"cell_type": "markdown",
199+
"id": "ByUPtcy9EsCn",
200+
"metadata": {
201+
"id": "ByUPtcy9EsCn"
202+
},
203+
"source": [
204+
"## [**Solution**](./stop_words_exercise_solutions.ipynb)"
205+
]
206+
}
207+
],
208+
"metadata": {
209+
"colab": {
210+
"collapsed_sections": [],
211+
"name": "stop_words_exercise_solutions.ipynb",
212+
"provenance": []
213+
},
214+
"kernelspec": {
215+
"display_name": "Python 3 (ipykernel)",
216+
"language": "python",
217+
"name": "python3"
218+
},
219+
"language_info": {
220+
"codemirror_mode": {
221+
"name": "ipython",
222+
"version": 3
223+
},
224+
"file_extension": ".py",
225+
"mimetype": "text/x-python",
226+
"name": "python",
227+
"nbconvert_exporter": "python",
228+
"pygments_lexer": "ipython3",
229+
"version": "3.8.10"
230+
}
231+
},
232+
"nbformat": 4,
233+
"nbformat_minor": 5
234+
}

0 commit comments

Comments
 (0)