codebasics
diff --git a/‎10_stop_words/stop_words_exercise.ipynb
Lines changed: 234 additions & 0 deletions b/‎10_stop_words/stop_words_exercise.ipynb
Lines changed: 234 additions & 0 deletions
@@ -0,0 +1,234 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "id": "b17f58fa",
+   "metadata": {
+    "id": "b17f58fa"
+   },
+   "source": [
+    "###                     **Stop Words: Exercise**"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "23a26def",
+   "metadata": {
+    "id": "23a26def"
+   },
+   "source": [
+    "- **Run this cell to import all necessary packages**"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 1,
+   "id": "34f02550",
+   "metadata": {
+    "id": "34f02550"
+   },
+   "outputs": [],
+   "source": [
+    "#import spacy and load the model\n",
+    "\n",
+    "import spacy\n",
+    "nlp = spacy.load(\"en_core_web_sm\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "0fe230d8",
+   "metadata": {
+    "id": "0fe230d8"
+   },
+   "source": [
+    "**Exercise1:** \n",
+    "- From a Given Text, Count the number of stop words in it.\n",
+    "- Print the percentage of stop word tokens compared to all tokens in a given text."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 1,
+   "id": "646c9e7a",
+   "metadata": {
+    "colab": {
+     "base_uri": "https://localhost:8080/"
+    },
+    "id": "646c9e7a",
+    "outputId": "7d59339e-bf53-4239-eda5-134e6af42e22"
+   },
+   "outputs": [],
+   "source": [
+    "text = '''\n",
+    "Thor: Love and Thunder is a 2022 American superhero film based on Marvel Comics featuring the character Thor, produced by Marvel Studios and \n",
+    "distributed by Walt Disney Studios Motion Pictures. It is the sequel to Thor: Ragnarok (2017) and the 29th film in the Marvel Cinematic Universe (MCU).\n",
+    "The film is directed by Taika Waititi, who co-wrote the script with Jennifer Kaytin Robinson, and stars Chris Hemsworth as Thor alongside Christian Bale, Tessa Thompson,\n",
+    "Jaimie Alexander, Waititi, Russell Crowe, and Natalie Portman. In the film, Thor attempts to find inner peace, but must return to action and recruit Valkyrie (Thompson),\n",
+    "Korg (Waititi), and Jane Foster (Portman)—who is now the Mighty Thor—to stop Gorr the God Butcher (Bale) from eliminating all gods.\n",
+    "'''\n",
+    "\n",
+    "#step1: Create the object 'doc' for the given text using nlp()\n",
+    "\n",
+    "\n",
+    "\n",
+    "#step2: define the variables to keep track of stopwords count and total words count\n",
+    "\n",
+    "\n",
+    "\n",
+    "#step3: iterate through all the words in the document\n",
+    "\n",
+    "\n",
+    "\n",
+    "#step4: print the count of stop words\n",
+    "\n",
+    "    \n",
+    "\n",
+    "#step5: print the percentage of stop words compared to total words in the text\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "vsJaC5a-ldY-",
+   "metadata": {
+    "id": "vsJaC5a-ldY-"
+   },
+   "source": [
+    "**Exercise2:** \n",
+    "\n",
+    "- Spacy default implementation considers **\"not\"** as a stop word. But in some scenarios removing 'not' will completely change the meaning of the statement/text. For Example, consider these two statements:\n",
+    "\n",
+    "      - this is a good movie       ----> Positive Statement\n",
+    "      - this is not a good movie   ----> Negative Statement\n",
+    "\n",
+    "- So, after applying stopwords to those 2 texts, both will return **\"good movie\"** and does not respect the polarity/sentiments of text.\n",
+    "  \n",
+    "- Now, your task is to remove this stop word **\"not\"** in spaCy and help in distinguishing the texts.\n",
+    "\n",
+    "\n",
+    "- **Hint:** GOOGLE IT! Google is your friend.\n",
+    "\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 2,
+   "id": "4e9e663a",
+   "metadata": {
+    "colab": {
+     "base_uri": "https://localhost:8080/"
+    },
+    "id": "4e9e663a",
+    "outputId": "72779ead-6cb9-4f92-da54-3e3a882c2069"
+   },
+   "outputs": [],
+   "source": [
+    "#use this pre-processing function to pass the text and to remove all the stop words and finally get the cleaned form\n",
+    "def preprocess(text):\n",
+    "    doc = nlp(text)\n",
+    "    no_stop_words = [token.text for token in doc if not token.is_stop]\n",
+    "    return \" \".join(no_stop_words)       \n",
+    "\n",
+    "\n",
+    "#Step1: remove the stopword 'not' in spacy\n",
+    "\n",
+    "\n",
+    "\n",
+    "#step2: send the two texts given above into the pre-process function and store the transformed texts\n",
+    "\n",
+    "\n",
+    "\n",
+    "#step3: finally print those 2 transformed texts\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "RWnHxZy-Fv5S",
+   "metadata": {
+    "id": "RWnHxZy-Fv5S"
+   },
+   "source": [
+    "**Exercise3:** \n",
+    "\n",
+    "- From a given text, output the **most frequently** used token after removing all the stop word tokens and punctuations in it. \n",
+    "\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 3,
+   "id": "GfLMTZmBFlPI",
+   "metadata": {
+    "colab": {
+     "base_uri": "https://localhost:8080/"
+    },
+    "id": "GfLMTZmBFlPI",
+    "outputId": "448095a9-954b-43e9-ad86-da7d48aed72c"
+   },
+   "outputs": [],
+   "source": [
+    "text = ''' The India men's national cricket team, also known as Team India or the Men in Blue, represents India in men's international cricket.\n",
+    "It is governed by the Board of Control for Cricket in India (BCCI), and is a Full Member of the International Cricket Council (ICC) with Test,\n",
+    "One Day International (ODI) and Twenty20 International (T20I) status. Cricket was introduced to India by British sailors in the 18th century, and the \n",
+    "first cricket club was established in 1792. India's national cricket team played its first Test match on 25 June 1932 at Lord's, becoming the sixth team to be\n",
+    "granted test cricket status.\n",
+    "'''\n",
+    "\n",
+    "\n",
+    "#step1: Create the object 'doc' for the given text using nlp()\n",
+    "\n",
+    "\n",
+    "\n",
+    "#step2: remove all the stop words and punctuations and store all the remaining tokens in a new list\n",
+    "\n",
+    "\n",
+    "\n",
+    "#step3: create a new dictionary and get the frequency of words by iterating through the list which contains stored tokens  \n",
+    "\n",
+    "\n",
+    "\n",
+    "#step4: get the maximum frequency word\n",
+    "\n",
+    "\n",
+    "\n",
+    "#step5: finally print the result\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "ByUPtcy9EsCn",
+   "metadata": {
+    "id": "ByUPtcy9EsCn"
+   },
+   "source": [
+    "## [**Solution**](./stop_words_exercise_solutions.ipynb)"
+   ]
+  }
+ ],
+ "metadata": {
+  "colab": {
+   "collapsed_sections": [],
+   "name": "stop_words_exercise_solutions.ipynb",
+   "provenance": []
+  },
+  "kernelspec": {
+   "display_name": "Python 3 (ipykernel)",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.8.10"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}