# Using Jupyter Notebooks



In [None]:
!pip install nltk spacy gensim pandas scikit-learn
#pip: A package manager for Python that allows you to install and manage software packages written in Python.
#install: A command used with pip to download and install the specified packages from the Python Package Index (PyPI) or other repositories.
#nltk: Stands for Natural Language Toolkit, a library used for working with human language data (text). It provides easy-to-use interfaces and resources for tasks like tokenization, parsing, and classification.
#spacy: An open-source library for advanced Natural Language Processing (NLP) in Python. It is designed for performance and includes features like tokenization, part-of-speech tagging, and named entity recognition.
#gensim: A Python library for topic modeling and document similarity analysis, primarily used for processing and analyzing large text corpora.
#pandas: A powerful data manipulation and analysis library for Python, providing data structures like DataFrames for handling structured data.
#scikit-learn: A machine learning library for Python that provides simple and efficient tools for data mining and data analysis, including classification, regression, clustering, and dimensionality reduction.



##Introduction
Jupyter notebooks is an open-source web-based Python editor which runs in your browser. It allows a combination of text written in a html-like format known as "markdown", such as the block of text you're reading right now, and inline code, tools and outputs such as this one:

In [None]:
import nltk
nltk.download('punkt') # Download the Punkt tokenizer
from nltk.tokenize import word_tokenize

sentence = "Hello, world! This is NLP."
tokens = word_tokenize(sentence)
print(tokens)
#import: A statement used to include modules or specific functions from a module into your Python program, allowing you to use the features defined in those modules.
#from: A keyword used in conjunction with import to specify the exact module or part of a module you want to import. In this case, it suggests importing from a specific submodule.
#tokenize: Refers to a submodule or package named tokenize that is part of the current package (denoted by the dot). It likely contains functions for breaking text into tokens.
#sentence: A variable that stores a string containing a sentence. In this case, it is "Hello, world! This is NLP."
#tokens: A variable that stores the result of the tokenization process, which is the output of the word_tokenize function applied to the sentence.
#word_tokenize: A function (presumably from the tokenize module) that takes a string as input and splits it into individual tokens (words and punctuation).
#print: A built-in Python function that outputs the specified message or variable value to the console.
#Output: A comment that describes the expected output of the preceding line of code. In this case, it shows what the tokens variable will contain after tokenization.
#['Hello', ',', 'world', '!', 'This', 'is', 'NLP', '.']: A list of tokens resulting from the tokenization of the input sentence. Each word and punctuation mark is treated as a separate token.

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


['Hello', ',', 'world', '!', 'This', 'is', 'NLP', '.']


In [None]:
import nltk
nltk.download('punkt') # Download the Punkt tokenizer
from nltk.tokenize import word_tokenize


In [None]:
from nltk.corpus import stopwords
nltk.download('stopwords')
stop_words = set(stopwords.words('english'))
filtered_tokens = [word for word in tokens if word.lower() not in
stop_words]
print(filtered_tokens)
#corpus: A term often used to refer to a large collection of texts used for linguistic research, though in this context it seems to be part of the import statement.
#stopwords: A module or submodule (likely from the NLTK library) that provides a list of common words (like "the," "is," "in") that are usually filtered out in text processing because they carry less meaning.
#download: A method from the NLTK library that retrieves and installs specified resources (like stopword lists) for use in your NLP tasks.
#stop_words: A variable that stores a set of English stopwords, which are words that are typically ignored in natural language processing tasks.
#set: A built-in Python data type that represents an unordered collection of unique elements. Here, it’s used to store stopwords for efficient membership testing.
#filtered_tokens: A variable that stores a list of tokens after filtering out the stopwords.
#tokens: A list variable (from the previous snippet) that contains the tokenized words and punctuation from the original sentence.
#word: A variable used in the list comprehension to represent each individual token as it is processed.
#word.lower(): A method that converts a string to lowercase, allowing for case-insensitive comparison.
#not in: A membership operator that checks if a specified element is not present in a collection (in this case, checking if the token is not a stopword).
#print: A built-in Python function that outputs the specified message or variable value to the console.

['Hello', ',', 'world', '!', 'NLP', '.']


[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


In [None]:
from nltk.stem import PorterStemmer
from nltk.stem import WordNetLemmatizer
nltk.download('wordnet')
ps = PorterStemmer()
lemmatizer = WordNetLemmatizer()
print(ps.stem("faster")) # Output: run
print(lemmatizer.lemmatize("faster")) # Output: running (more context needed for lemmatization)
#stem: This is likely a part of the import statement for the PorterStemmer class from the NLTK library, which is used for stemming words.
#PorterStemmer: A class from the NLTK library that implements the Porter stemming algorithm, which reduces words to their base or root for#m (e.g., "running" becomes "run").
#nltk.stem: A submodule in the NLTK library that provides classes and functions for stemming and lemmatization.
#WordNetLemmatizer: A class in the NLTK library that uses the WordNet lexical database to perform lemmatization, which is the process of reducing a word to its base form based on its meaning (e.g., "better" becomes "good").
#nltk.download('wordnet'): A method that downloads the WordNet lexical database, which is required for the WordNetLemmatizer to function properly.
#ps: A variable that is an instance of the PorterStemmer class, allowing access to its stemming methods.
#lemmatizer: A variable that is an instance of the WordNetLemmatizer class, allowing access to its lemmatization methods.
#print: A built-in Python function that outputs the specified message or variable value to the console.
#ps.stem("faster"): A method call that applies the stemming process to the word "faster," returning its stemmed form. This will likely output "faster" or "fast" depending on the implementation.


faster
faster


[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


In [None]:
import pandas as pd
import nltk
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn import metrics.
#pandas as pd: An import statement that includes the pandas library and gives it the alias pd. Pandas is used for data manipulation and analysis.
#nltk: The Natural Language Toolkit, a library for working with human language data, though it's included here without any specific functions being imported.
#from: A keyword used in conjunction with import to specify the exact module or part of a module you want to import.
#sklearn.model_selection: A submodule of scikit-learn that provides functions for splitting datasets into training and testing sets.
#train_test_split: A function from the model_selection submodule that splits arrays or matrices into random train and test subsets, useful for evaluating machine learning models.
#sklearn.feature_extraction.text: A submodule of scikit-learn that includes tools for converting text data into numerical features.
#CountVectorizer: A class from the feature_extraction.text submodule that converts a collection of text documents to a matrix of token counts, representing how many times each word appears.
#sklearn.naive_bayes: A submodule of scikit-learn that implements Naive Bayes classifiers for various types of data.
#MultinomialNB: A class from the naive_bayes submodule that implements the Multinomial Naive Bayes algorithm, commonly used for classification tasks with discrete features (like word counts).
#sklearn: Short for scikit-learn, a machine learning library in Python that provides simple and efficient tools for data mining and data analysis.
#metrics: A submodule of scikit-learn that provides functions for evaluating the performance of machine learning models, such as accuracy, precision, recall, and F1-score.


In [None]:
data = {
 'text': [
 'I love this movie!',
 'This was a terrible movie.',
 'I really enjoyed the film.',
 'Worst experience ever.',
 'It was fantastic!',
 'Not worth the time.',
 'Absolutely amazing!',
 'It was okay, not great.',
 'I hate this film.',
 'Best movie ever!'
 ],
 'sentiment': [
 'negative',
 'positive',
 'negative',
 'positive',
 'negative',
 'positive',
 'negative',
 'neutral',
 'negative',
 'positive'
 ]
}

In [None]:
df = pd.DataFrame(data)
#df: A variable that typically stands for "DataFrame," which is a two-dimensional, size-mutable, potentially heterogeneous tabular data structure in pandas.
#pd: The alias for the pandas library, allowing access to its functions and classes. It was defined in the earlier import statement (import pandas as pd).
#DataFrame: A class in the pandas library that represents a table of data, similar to a spreadsheet or SQL table. It allows for storing data in rows and columns with labeled axes.
#data: A variable (not explicitly defined in this snippet) that is expected to contain the data to be converted into a DataFrame. This could be in various forms, such as a dictionary, list of lists, or a NumPy array

In [None]:
print(df)

                         text sentiment
0          I love this movie!  negative
1  This was a terrible movie.  positive
2  I really enjoyed the film.  negative
3      Worst experience ever.  positive
4           It was fantastic!  negative
5         Not worth the time.  positive
6         Absolutely amazing!  negative
7     It was okay, not great.   neutral
8           I hate this film.  negative
9            Best movie ever!  positive


In [None]:
X = df['text']
y = df['sentiment']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,
random_state=42)
# Vectorize the text
vectorizer = CountVectorizer()
X_train_vectorized = vectorizer.fit_transform(X_train)
X_test_vectorized = vectorizer.transform(X_test)
#X: A variable that typically represents the feature set (input data) in a machine learning context. In this case, it is set to the 'text' column of the DataFrame df.
#y: A variable that typically represents the target variable (output data) in a machine learning context. Here, it is set to the 'sentiment' column of the DataFrame df.
#X_train, X_test, y_train, y_test: Variables that hold the training and testing datasets after splitting. X_train and y_train are for training, while X_test and y_test are for testing the model.
#train_test_split: A function that splits the dataset into training and testing subsets. It takes the input features (X), the target variable (y), the proportion of the data to be used for testing (test_size), and a random state for reproducibility.
#test_size=0.2: A parameter in the train_test_split function that specifies that 20% of the data should be allocated to the test set, while the remaining 80% will be used for training.
#random_state=42: A parameter that sets the seed for the random number generator, ensuring that the split is reproducible across different runs.
#vectorizer: A variable that holds an instance of the CountVectorizer class, which is used to convert text data into a numerical format.
#CountVectorizer(): A class from scikit-learn that converts a collection of text documents to a matrix of token counts.
#X_train_vectorized: A variable that stores the vectorized representation of the training text data (X_train), created by fitting and transforming the CountVectorizer.
#X_test_vectorized: A variable that stores the vectorized representation of the testing text data (X_test), created by transforming the data using the previously fitted vectorizer.
#fit_transform: A method of the CountVectorizer that learns the vocabulary from the training data and transforms it into a document-term matrix.
#transform: A method of the CountVectorizer that transforms new data into the document-term matrix using the vocabulary learned from the training data.


In [None]:
model = MultinomialNB()
model.fit(X_train_vectorized, y_train)
#model: A variable that holds an instance of a machine learning model. In this case, it represents the Naive Bayes classifier.
#MultinomialNB(): A class from the naive_bayes module in scikit-learn that implements the Multinomial Naive Bayes algorithm, commonly used for classification tasks involving discrete features, like word counts.
#model.fit(): A method that trains the model using the training data. It takes the feature set (X_train_vectorized) and the target variable (y_train) as inputs.
#X_train_vectorized: The matrix of token counts for the training data, which was created using the CountVectorizer earlier. This serves as the input for training the model.
#y_train: The target variable for the training set, representing the sentiment labels corresponding to the training data.


In [None]:
y_pred = model.predict(X_test_vectorized)
#y_pred: A variable that stores the predicted values generated by the model for the test dataset. It represents the model's output (predicted sentiments) based on the input features.
#model.predict(): A method used to make predictions based on the input features. It takes the vectorized test data (X_test_vectorized) as input and returns the predicted labels.
#X_test_vectorized: The matrix of token counts for the test data, created using the CountVectorizer. This serves as the input for making predictions with the trained model.

In [None]:
accuracy = metrics.accuracy_score(y_test, y_pred)
confusion_matrix = metrics.confusion_matrix(y_test, y_pred)
print(f'Accuracy: {accuracy:.2f}')
print('Confusion Matrix:')
print(confusion_matrix)
#accuracy: A variable that stores the accuracy of the model, which is a measure of how often the model's predictions match the actual labels in the test dataset.
#metrics: A submodule of scikit-learn that provides functions for evaluating the performance of machine learning models.
#accuracy_score(): A function from the metrics module that calculates the accuracy of predictions by comparing the true labels (y_test) with the predicted labels (y_pred).
#confusion_matrix: A variable that stores the confusion matrix, which is a table used to evaluate the performance of a classification model. It shows the number of true positive, true negative, false positive, and false negative predictions.
#confusion_matrix(): A function from the metrics module that computes the confusion matrix based on the true labels (y_test) and predicted labels (y_pred).
#print(): A built-in Python function used to output messages or variable values to the console.
#f'Accuracy: {accuracy:.2f}': An f-string (formatted string) that formats the accuracy value to two decimal places for display in the output.
#'Confusion Matrix:': A string that serves as a label for the confusion matrix output.


Accuracy: 0.00
Confusion Matrix:
[[0 2]
 [0 0]]


In [None]:
def predict_sentiment(text):
 text_vectorized = vectorizer.transform([text])
 prediction = model.predict(text_vectorized)
 return prediction[0]
# Example usage
new_text = "I loved the plot and the acting!"
print(f'Sentiment: {predict_sentiment(new_text)}')
#def: A keyword used to define a new function in Python.
#predict_sentiment: The name of the function being defined. It takes a single argument, text, which represents the input text whose sentiment needs to be predicted.
#text: A parameter of the function that represents the input string for which sentiment prediction is to be made.
#text_vectorized: A variable that stores the vectorized representation of the input text. This is done by transforming the text using the previously fitted CountVectorizer.
#vectorizer.transform(): A method that transforms the input text into a document-term matrix using the vocabulary learned from the training data.
#model.predict(): A method that predicts the sentiment based on the vectorized input text. It returns the predicted sentiment label.
#prediction: A variable that stores the result of the prediction, which is typically an array of predicted sentiment labels.
#return: A statement that exits the function and sends back the specified value to the caller.
#prediction[0]: Accesses the first element of the prediction array, which represents the predicted sentiment label.
#new_text: A variable that stores a new input string, in this case, "I loved the plot and the acting!", for which the sentiment will be predicted.
#print(): A built-in Python function used to output messages or variable values to the console.
#f'Sentiment: {predict_sentiment(new_text)}': An f-string that calls the predict_sentiment function with new_text as an argument and formats the output string to display the predicted sentiment.


Sentiment: negative


In [None]:
print("Hello World")

This combination allows for the procution of beautiful documents containing software, documentation and discussion. For larger codes you may wish to use Python in a stand-alone environment such as a traditional IDE. But for demonstration purposes Jupyter is a very useful tool.

Notebook files have the extension ".ipynb" extension. A Jupyter notebook is one of many environments you may run Python code.  Colab and the Jupyter notebook editor in Anaconda are two of the many pieces of software you may use to write and run a Jupyter notebook. For this course we recommend using the online Google Colab tool, but you can use Anaconda to run the notebooks on your own machine within an internet connection. On college computers, Jupyter can be used by launchng Anaconda from the Software Hub Apps Anywhere interface.

Note that exact interfaces will differ between different environments but the same functionality should be found in most environments. This course will be using the Colab environment.

## Cells and Executing Code

A notebooks is made up of one or more "cells". Cells can contain the html-like text used to generate text or code to be run by the user. A cell containing a piece of code may be recognised by the the  ```[]```  to the left of it. Code in these blocks can be run in a nubmer of ways. The simplest is click on the ```[ ]``` . This will execute the code. Try this with the code snippet below:

In [None]:
print("Yes, it worked!")

You should have seen the message "Yes, it worked!" appear immediately beneath the code. This is the output of the code, which has been printed to the screen. You may also have noticed a number appear between the square brackets to the left of the code snippet. This indicates the order in which the code snippet has been executed. Code cells may be executed in any order and variables will be saved between execution of code snippets. To try this, execute the three codes snippets below in the following order:
- 1
- 2
- 3
- 2

In [None]:
a="Message 1"

In [None]:
print(a)

In [None]:
a="Message 2"

The first time you ran code snippet 1 you should have seen "Message 1" as the output and the second time the output should have been "Message 2". This is because the first time it was run, the value assigned to the variable named "a" was "Message" as set by the first code snippet and the second time it was "Message 2" as set by the third code snippet. Note also the current numbers contained within square brackets. These help you to kno which cells have been executed and in which order.

##Sharing Jupyter Notebooks on Colab
When a Jupyter Notebook is shared with you on Colab, you will often receive access to the notebook which will alow you to run code, but not edit it. This should be the case for the notebooks that form part of this course. In this case you can select "Save a Copy in Drive" from the "File" menu to create a new copy that is yours and yo can edit.

For this course, it is reccommended that you create two copies. One of these should be the original copy without your edits, and another which you can edit to compelte exercises or expierment.

## Basic Jupyter Commands

Jupyter contains a number of useful tools for executing these cells. By using the "Runtime" menu, you can run multiple cells at a time using "Run all", "Run before", "Run selected" and "Run after".

You can clear output (this is the term for what is written under a code cell when it's executed) by clicking on the symbol to the left of it. You can clear all outputs from the notebook using the "Clear All Outputs" command on the "Edit" menu. Clearing the output will not unset variables set by the code snippets run, only remove the output printed to the screen.

To unset variables, use the "Restart Runtime" or "Reset Runtime" option in the Runtime menu. The "Interrupt Execution" command on the kernel menu will halt the procesing of code, which can be useful if you've accidentally written a piece of code that will never finish executing or if the code is taking too long to execute.

The "insert" menu allows you to create new cells. The "cell type" option in the "cell" menu allows you toggle the current cell type between the different cell types available:
- **Code**: Code snippets
- **Text**: The html-like language used to generate text, tables, equations, etc.

Alternatively, you can hover your mouse in the space after a cell and add a code or text cell there.

###Exercise

Try each of these commands from the different menus for yourself on this  notebook and ensure they behave as you would expect.

## Text Cells in Jupyter
You can include all sort so information in Jupyter text cells to obtain different effects. To see how each of the following examples is generated, double click on this cell. To return to the formatted text, run the cell.

### Headings
Headings can be generated using the hash symbol "#". The more of these there are, the smaller the heading. The sub-sub-heading above is an example.

### Tables
Tables can be created in a way similar to basic html, using the a comabination of the "|" and "-" symbols:

| This | is    |
|------|-------|
|   a  |  table|
| It's | fancy |

### Equations
Equations can be written in a way similar to LaTeX by surrouding the text with "\$" symbols:

$a=\frac{\int\limits_{0}^{\pi} \sin{(bx)} \textrm{d}x}{4}$

Don't worry if you don't understand the exact syntax used to generate this example. In your example of it in your exercise, try writing something very simple instead. If it looks like a simple algebraic expression, it will probably render how you intend.

### Code Snippets
You can write snippets of code in a text cell and they will be highlighted as if they were code written in a code cell. This can be useful for demonstrating a code feature in a textual way. For example:

```python
print ("Hello World")
```

There is not a way to run this code, it is merely normal text highlighted to look like code. The "python" which precedes the code itself tells Jupyter which language you are writing the code snippet in so it can be highlighted accorindly.

In some environments, text cells may also be referred to as "markdown" cells.

###Exercise
Try creating simple versions of each of the constructs above in a new text cell below this one.