In [1]:
import torch


PyTorch is an open-source machine learning library primarily developed by Facebook's artificial-intelligence research group. It is based on the Torch library and provides a flexible and powerful environment for building and deploying deep learning models.

One of the main features of PyTorch is its dynamic computation graph, which allows for more flexibility in building and modifying models compared to static computation graphs used in libraries such as TensorFlow. This makes it more suited for research and experimentation, as well as for tasks that require the model to adapt to changing inputs.

PyTorch also includes support for CUDA, which is a technology that allows for the acceleration of computations on NVIDIA GPUs. This makes it possible to train and deploy large, complex models much faster than on a CPU.

PyTorch provides a wide variety of pre-built and pre-trained models for various tasks like computer vision, natural language processing, and recommendation systems. It also provides tools for data loading, visualization, and model evaluation.

Pytorch's main modules are torch and torch.nn, torch provides basic functionality like tensor computation and it's an extension of numpy array. torch.nn provide neural network functionalities. With the torch.optim, we can define optimizer like SGD, Adam and so on, while with torch.nn.functional it provides a large set of neural network functions like loss functions, activation functions and so on.





A tensor is a multi-dimensional array of data. It is the fundamental data structure in PyTorch and other deep learning frameworks. Tensors can be thought of as generalizations of vectors and matrices to higher dimensions.
Tensors are useful for a wide range of tasks, such as image and signal processing, natural language processing, and machine learning. They can be used to represent the data and the parameters of a model, and can also be used as the inputs and outputs of a model.
One of the main advantages of using tensors is that they can be used on a GPU, which allows for much faster computations than using a CPU alone. PyTorch's tensors are designed to seamlessly move between the CPU and GPU, which makes it easy to train large models on a GPU without the need for complex device management code.

In summary, tensors are multi-dimensional arrays of data and provide a powerful and flexible data structure for building and training machine learning models using deep learning frameworks like Pytorch, they can be processed by the GPU which allows fast computations, making them suitable for deep learning tasks.


In [2]:
import warnings
warnings.filterwarnings('ignore')

# Install and import dependencies

In [3]:
!pip install torch==1.12.1 torchvision==0.13.1 torchaudio===0.8.1 -f https://download.pytorch.org/whl/torch_stable.html

Looking in links: https://download.pytorch.org/whl/torch_stable.html


ERROR: Could not find a version that satisfies the requirement torchaudio===0.8.1 (from versions: 0.11.0, 0.11.0+cpu, 0.11.0+cu113, 0.11.0+cu115, 0.12.0, 0.12.0+cpu, 0.12.0+cu113, 0.12.0+cu116, 0.12.1, 0.12.1+cpu, 0.12.1+cu113, 0.12.1+cu116, 0.13.0, 0.13.0+cpu, 0.13.0+cu116, 0.13.0+cu117, 0.13.1, 0.13.1+cpu, 0.13.1+cu116, 0.13.1+cu117)
ERROR: No matching distribution found for torchaudio===0.8.1


This command is using pip, which is a package manager for Python, to install three packages: torch, torchvision, and torchaudio. These packages are part of the PyTorch ecosystem, which is a popular machine learning library for Python.

The torch package is the main PyTorch library, which provides functions for building and training neural networks. The torchvision package is a library for computer vision tasks that builds on top of torch. It provides datasets and models for image and video tasks such as object detection, semantic segmentation, and image classification. The torchaudio package is a library for audio tasks that also builds on top of torch. It provides datasets and models for speech recognition, sound classification and audio processing

The version numbers for each package are specified with the == operator, which tells pip to install a specific version of the package. The versions specified in this command are: torch==1.8.1+cu111, torchvision==0.9.1+cu111 and torchaudio===0.8.1.

The +cu111 at the end of the version number for torch and torchvision indicate that this version of the packages are built with CUDA support for CUDA version 11.1. CUDA is a parallel computing platform and programming model developed by NVIDIA that allows the use of GPUs for accelerating computations.

The -f https://download.pytorch.org/whl/torch_stable.html is flag and the URL is the location of a PyPI index containing the package. This command will tell pip to download and install the packages from this index instead of the default PyPI index.

Overall, this command installs specific versions of the torch, torchvision, and torchaudio packages from the PyTorch ecosystem that are built with CUDA support for CUDA version 11.1 from https://download.pytorch.org/whl/torch_stable.html index, which should help with getting specific version and support on the system.

In [4]:
!pip install transformers request beatifulsoup4 pandas numpy



ERROR: Could not find a version that satisfies the requirement request (from versions: none)
ERROR: No matching distribution found for request


The command you provided is installing several libraries that are commonly used in natural language processing and machine learning tasks. Here's a brief explanation of each library:

transformers: This is a library developed by Hugging Face that provides pre-trained state-of-the-art models for natural language processing tasks such as text classification, question answering, and language translation. These models are based on transformer architecture, which has shown to be highly effective in NLP tasks.

requests: This library allows you to send HTTP/1.1 requests extremely easily. It abstracts the complexities of making requests behind a beautiful, simple API so that you can focus on interacting with services.

beautifulsoup4: This is a library for pulling data out of HTML and XML files. It creates parse trees from page source code that can be used to extract data from HTML, which is useful for web scraping.

pandas: This library provides easy-to-use data structures and data analysis tools for Python. It is particularly useful for working with tabular data, such as data from a CSV file or a SQL database.

numpy: This library provides a wide variety of numerical operations for numerical computing in Python. It is a dependency for many other libraries in the scientific Python ecosystem, and is often used for arrays, matrices, and mathematical operations on arrays and matrices.

Together, these libraries provide powerful tools for performing NLP and machine learning tasks, such as text preprocessing, feature extraction, and model training and evaluation.

In [5]:
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch
import requests
from bs4 import BeautifulSoup
import re

AutoTokenizer and AutoModelForSequenceClassification from transformers: These are classes from the transformers library that provide an easy-to-use API for loading pre-trained transformer models. AutoTokenizer can be used to tokenize input text, which is a necessary step before feeding text into a transformer model. AutoModelForSequenceClassification can be used to load a pre-trained transformer model for sequence classification tasks.

torch: This is the PyTorch library, which provides a wide variety of tools for building and training neural networks. It is often used in conjunction with the transformers library, as many of the models in transformers are implemented in PyTorch.

requests: This library allows you to send HTTP requests to a server and handle the server's response. It provides a simple and consistent interface for interacting with web services, which makes it useful for web scraping and other types of data collection tasks.

BeautifulSoup from bs4: This is a class from the Beautiful Soup library that is used to parse and navigate HTML and XML documents. It is often used in conjunction with the requests library to extract data from web pages.

re : python built-in library for working with regular expressions. It provides functions for searching, splitting, and manipulating strings. Regular expressions are a powerful tool for text processing and is often used for pattern matching, string manipulation, and data validation.

In summary, the imports in the code you provided bring in several powerful libraries and classes for tokenization, loading pre-trained models for sequence classification, numerical computation with Pytorch, making requests to the web, and parsing and manipulating HTML and xml, and working with regular expressions.





# Instantiate Model

In [6]:
tokenizer = AutoTokenizer.from_pretrained('nlptown/bert-base-multilingual-uncased-sentiment')

model = AutoModelForSequenceClassification.from_pretrained('nlptown/bert-base-multilingual-uncased-sentiment')

Downloading:   0%|          | 0.00/39.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/953 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/872k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/112 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/669M [00:00<?, ?B/s]

we are using the Hugging Face's Transformers library, which is a library for natural language processing tasks that is built on top of TensorFlow and PyTorch.

The first line of code creates an instance of the AutoTokenizer class from the Hugging Face's library. The AutoTokenizer.from_pretrained() function takes a string as an argument, which is the name of a pre-trained model to use. In this case, the model that is being loaded is 'nlptown/bert-base-multilingual-uncased-sentiment', which is a pre-trained BERT model fine-tuned on sentiment analysis datasets and is multi-lingual. The tokenizer is used to tokenize the text, which means to split the text into individual tokens (usually words or subwords).

The second line of code creates an instance of the AutoModelForSequenceClassification class, also from the Hugging Face's library. The AutoModelForSequenceClassification.from_pretrained() function also takes a string as an argument, which is the name of a pre-trained model to use. In this case, it is the same model 'nlptown/bert-base-multilingual-uncased-sentiment', which is loaded.

This BERT model is fine-tuned for sentiment analysis and it's multilingual, which means it can handle text in multiple languages, so it's a good option for sentiment analysis in different languages, you don't need to train the model from scratch.

After loading the model, you can use the model to predict the sentiment of new sentences.
It's also important to notice that fine-tuning models can be done on specific labeled dataset also to improve the performance on specific domain, as well as the training data can be preprocessed accordingly.





# Encode and calculate sentiment

In [9]:
tokens = tokenizer.encode('It was good but can be improved',return_tensors='pt')


This code is using a tokenizer, which is a tool that converts a sequence of text into a sequence of tokens (typically integers). The specific method being called is encode, which takes a string of text ('It was good but can be improved') as input and returns a tensor (a multi-dimensional array) of tokens. The 'return_tensors' argument is set to 'pt' which indicate the returned tensor will be a PyTorch tensor, which is a specific type of tensor that can be used with the PyTorch deep learning framework.





In [10]:
result = model(tokens)

This code is using a model, which is likely a pre-trained machine learning model. The specific function being called is the model's __call__ method (also known as the "forward" method), which takes as input the tensor of tokens returned by the tokenizer in the previous line.

The model applies some mathematical operations on the inputs that were pre-defined during the training stage of the model, and produce output which might be a set of predictions, logits or any other type of output. It depends on the specific architecture and the task that the model was trained for.

Additionally, result will be the output of the model, which is the result of applying the model to the input tokens. It could be for example the probabilities for a set of classes or any other values depending on the model architecture and the task.

In [11]:
result.logits

tensor([[-1.8678, -0.1295,  1.7736,  1.2541, -0.8785]],
       grad_fn=<AddmmBackward0>)

result.logits refers to the raw, unnormalized scores or predictions of a model before a final activation function is applied. These values are also known as log-odds or logits, hence the name logits.

In some machine learning models, particularly those used for classification tasks, the output of the model is often a set of logits, which are then passed through a final activation function, such as the softmax function, to produce a probability distribution over the classes. The class with the highest probability is then chosen as the final prediction of the model.

In [12]:
int(torch.argmax(result.logits))+1

3

The torch.argmax() function is used to find the index of the maximum value in a tensor along a specific dimension. In this case, torch.argmax(result.logits) will return the index of the highest logit value.

Then int() casting is applied over the argmax output, which will convert the data type of the output to int.

Finally, +1 is applied on the argmax output which means that this is returning the class label of the highest logit value +1, so it will represent the class label with an index starting from 1, instead of 0.

# collects Reviews

In [16]:
!pip install autoscraper



In [17]:
from autoscraper import AutoScraper

In [44]:
r = requests.get('https://www.yelp.com/biz/social-brew-cafe-pyrmont')
soup = BeautifulSoup(r.text, 'html.parser')
regex = re.compile('.*comment.*')
results = soup.find_all('p', {'class':regex})
reviews = [result.text for result in results]

This code is using the requests library to send an HTTP GET request to the specified URL ('https://www.yelp.com/biz/social-brew-cafe-pyrmont'), which is the page of a café on the Yelp website. The request's response is stored in the variable r.

Then it is using the BeautifulSoup library to parse the HTML of the page, and creates an object soup, which allows the code to search for specific elements in the HTML using methods such as find_all().

Then, it is creating a regular expression regex which contains the string '.*comment.*' and it's used in the next line with find_all() method to find all the HTML <p> elements that have a class attribute containing the string "comment". The results of this search are stored in the variable results.

Finally it's iterating over the results array, and for each element of the array , it's extracting the text content of the element and storing it in the list reviews.

So reviews list will contains all the text content which are the reviews or comments of the Social Brew Cafe in Pyrmont.

In [45]:
reviews

['Great food amazing coffee and tea. Short walk from the harbor. Staff was very friendly',
 "It was ok. Had coffee with my friends. I'm new in the area, still need to discover new places.",
 'Great staff and food. \xa0Must try is the pan fried Gnocchi! \xa0The staff were really friendly and the coffee was good as well',
 "Ricotta hot cakes! These were so yummy. I ate them pretty fast and didn't share with anyone because they were that good ;). I ordered a green smoothie to balance it all out. Smoothie was a nice way to end my brekkie at this restaurant. Others with me ordered the salmon Benedict and the smoked salmon flatbread. They were all delicious and all plates were empty. Cheers!",
 'I came to Social brew cafe for brunch while exploring the city and on my way to the aquarium. I sat outside. The service was great and the food was good too!I ordered smoked salmon, truffle fries, black coffee and beer.',
 "It was ok. The coffee wasn't the best but it was fine. The relish on the brea

# Load Reviews into DataFrame and Score

In [37]:
import pandas as pd
import numpy as np

In [46]:
df = pd.DataFrame(np.array(reviews), columns=['review'])

This code is using the Pandas library to create a DataFrame, which is a 2-dimensional labeled data structure with columns of potentially different types.

First, it's using np.array(reviews) to convert the reviews list into a numpy array, which is a powerful library for working with arrays of data.

Then, it's passing that numpy array as the first argument to the pd.DataFrame() function, which creates a DataFrame from that array.

It also passing a second argument columns=['review'] to name the single column of the DataFrame 'review'.

So the resulting DataFrame df will contain a single column named 'review' and its rows will contain the elements of the reviews list. And this Dataframe will contain the reviews of the Social Brew Cafe in Pyrmont.

In [48]:
df['review'].iloc[5]

"It was ok. The coffee wasn't the best but it was fine. The relish on the breakfast roll was yum which did make it sing. So perhaps I just got a bad coffee but the food was good on my visit."

In [49]:
def sentiment_score(review):
    tokens = tokenizer.encode(review, return_tensors='pt')
    result = model(tokens)
    return int(torch.argmax(result.logits))+1

This function sentiment_score(review) takes a single argument review, which is a string of text representing a customer review of some product or service.

The function performs the following actions:

First, it tokenizes the input review by passing it through the tokenizer.encode method, which converts the text of the review into a sequence of tokens. It also passes the return_tensors='pt' argument to this method, which specifies that the returned tensor should be in the PyTorch format.

Next, the function applies the model to the tensor of tokens by passing it as an argument to the model's __call__ (or "forward") method, which applies the pre-defined mathematical operations on the input tokens.

Then, it takes the result result and extracts its logits using result.logits.

The function uses the torch.argmax() function to find the index of the maximum value in the logits tensor, which is likely representing the class with the highest predicted probability for the review.

It converts the output of the argmax to integer and add 1.

Finally, the function returns the int value obtained in step 5 which might be the sentiment score of the input review, this value can be used to map the sentiment to positive, neutral or negative.

In [50]:
sentiment_score(df['review'].iloc[1])

3

In [51]:
df['sentiment'] = df['review'].apply(lambda x: sentiment_score(x[:512]))

This code is using the Pandas apply() function, which applies a given function to each element of a specific column in a DataFrame. In this case, it's applying the sentiment_score(x[:512]) function to each element of the 'review' column in the DataFrame df.

It is also passing an argument lambda x: before the function, which allows the code to define an anonymous function (a function without a name) that is used only once and is thrown away after being used. This anonymous function takes the input x, and applies the sentiment_score() function on x[:512] which means that it applies the sentiment_score function on the first 512 characters of each review. This is useful when the reviews are too long and you only want to use a portion of the review for sentiment analysis.

Then it creates a new column 'sentiment' and assigns the output of the lambda function as the values for each rows of the sentiment column.

So after the execution of this line of code, the DataFrame df will contain a new column named 'sentiment' which will contain the sentiment scores for the reviews, these scores can be used to classify the reviews as positive, neutral or negative based on the score value.

In [52]:
df

Unnamed: 0,review,sentiment
0,Great food amazing coffee and tea. Short walk ...,5
1,It was ok. Had coffee with my friends. I'm new...,3
2,Great staff and food. Must try is the pan fri...,5
3,Ricotta hot cakes! These were so yummy. I ate ...,5
4,I came to Social brew cafe for brunch while ex...,5
5,It was ok. The coffee wasn't the best but it w...,3
6,We came for brunch twice in our week-long visi...,4
7,Ron & Jo are on the go down under and Wow! We...,5
8,I went here a little while ago- a beautiful mo...,2
9,Great coffee and vibe. That's all you need. C...,5
