<a href="https://colab.research.google.com/github/rskrisel/webscraping/blob/main/web_scraping.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Web Scraping in Python

In this workshop, we will learn how to retrieve text data using web scraping methods.

</br>

We will start by retrieving data from a single URL, then we will iterate this process across a list of URLs.

</br>

We will then clean our text data and visualize our results using Word Clouds and Lexical Density.

### Acknowledgements

This workshop is adapted from the following tutorials:
   - Martin Breuss, _Real Python_, [Beautiful Soup: Build a Web Scraper With Python](https://realpython.com/beautiful-soup-web-scraper-python/)
   - Brannon Seay, _Code X_, [A Beginner’s Guide to Easily Create a Word Cloud in Python](https://medium.com/codex/a-beginners-guide-to-easily-create-a-word-cloud-in-python-7c3078c705b7)
   - Melanie Walsh, _Introduction to Cultural Analytics_, Web Scraping  parts [I](https://melaniewalsh.github.io/Intro-Cultural-Analytics/04-Data-Collection/02-Web-Scraping-Part1.html) & [II](https://melaniewalsh.github.io/Intro-Cultural-Analytics/04-Data-Collection/03-Web-Scraping-Part2.html)
    

## What Is Web Scraping?
Web scraping is the process of gathering information from the Internet. </br>
Manually copying and pasting information from a website is a form of web scraping! </br>
However, “web scraping” usually involves automation.

Some websites don't mind web scraping while others have explicit terms of use against it (including most social media websites!) so always do your due diligence before scraping a website!

### Why use Web Scraping for Text Analysis?

Web scraping is an essential data collection method in the text analysis toolbox.
It allows researchers to automate the collection of text data directly from websites that can then be used for analysis.

### Drawbacks of Web Scraping

Data collected through web scraping is considered unstructured. It will exist as a disorganized string of letters and numbers. It is up to the researcher to organize the collected data, generally in tabular format, but not always.

## Installing the Necessary Python Libraries

The successfully to complete the workshop, you need the following libraries:
- **Requests** for making data requests from URLs (installation required)
- **BeautifulSoup** for cleaning up and decoding HTML text data (installation required)
- **Pandas** for visualizing and manipulating tabular data (comes standard with Anaconda)
- **NLTK** for text normalizing and cleaning (comes standard with Anaconda, but an additional feature needs to be installed)
- **Glob** to connect to directories on your OS (comes standard with Python 3, no download necessary)
- **Word Cloud** to create visual representation of word frequencies (download required)
- **Matplotlib** to visualize the Word Clouds (comes standard with Python 3, no download necessary)


### Setting up our workspace


1.   Mount your Drive
2.   Store your folder path in the `path` variable



In [None]:
from google.colab import drive
drive.mount('/content/drive')

In [None]:
path = 'your path' #copy/paste your path

### Installing the NLTK Wordnet Dictionary

The WordNet English dictionary is part of the Natural Language Tool Kit (NLTK) in Python.

Run the following code directly from your Notebook:

In [None]:
!pip install --upgrade nltk

In [None]:
import nltk
nltk.__version__

In [None]:
nltk.download('wordnet')

## Web Scraping from a single URL

### Responses and Requests

When you type in a URL in your search address bar, you’re sending an HTTP request for a web page, and the server which stores that web page will accordingly send back a response, some web page data that your browser will render.

The process of connecting to a URL link for web scraping is similar. We use the "request" library to connect to the data stored within a URL.

To start, we need to bring in our "requests" library into our Python environment:

In [None]:
import requests

Let's make our first request. This is a URL to an article on the Associated Press website:

In [None]:
response = requests.get("https://www.cnn.com/2024/09/27/us/eric-adams-nyc-mayor-arraignment/index.html")

Next, we can check to see whether or not the request was successful:

In [None]:
response

We get a 200 status code, which means our request was successful! Read here for more on status codes: https://developer.mozilla.org/en-US/docs/Web/HTTP/Status

In order to get the text data from the response we need to apply the .text method, and we can save the results in a new varibale hltm_string. The results from the data request will be in [HTML format](https://www.udacity.com/blog/2021/04/html-for-dummies.html).

In [None]:
html_string = response.text
print(html_string)

Let's bring in our [BeautifulSoup](https://www.crummy.com/software/BeautifulSoup/bs4/doc/) Python library to help us clean up and decode this HTML text data:

In [None]:
from bs4 import BeautifulSoup

Let's run our html_string variable through the Beautiful Soup object and use the get_text() function to extract the text from the HTML data. Then, let's use the print function to visualize our results:

In [None]:
soup = BeautifulSoup(html_string)
article = soup.get_text()
print(article)

Let's save our results in a text file for future use:

In [None]:
with open(f"{path}/article.txt","w") as file:
    file.write(article)

You can check your "web_scraping_workshop" folder to make sure the "article.txt" file was successfully saved:

In [None]:
! ls

Success! Congrats on scraping your first news article!

## Web Scraping a sequence of URLs

We are now going to learn how to scrape text from a collection of URLs saved as a CSV file. We will use the database of articles we collected during our API workshop.

In order to use this dataset, we need to bring it into our Python environment. For this we will use the Pandas library.

In [None]:
import pandas as pd

In [None]:
data_df= pd.read_csv(f"{path}/news_articles.csv", delimiter=',', encoding='utf-8')

In [None]:
data_df

### Exploring & Cleaning our Dataframe

As always, let's examine our dataframe first before we use it to perform any kind of calculation or automation.

Let's explore our data types:

In [None]:
data_df.dtypes

It looks like our date is stored as an object (meaning string) instead of a datatime value. Let's convert it (note: the format must match the current format of the data in the column)

In [None]:
data_df['publishedAt'] = pd.to_datetime(data_df['publishedAt'], format='ISO8601')

Let's check our data types again to make sure the conversion worked:

In [None]:
data_df.dtypes

Success! Our data is stored in the proper datatypes!

Let's look at our dataframe again:

In [None]:
data_df

Notice the NaN values. Pandas has special ways of dealing with missing data. Blank rows in a CSV file show up as NaN in a Pandas DataFrame. </br>
- For a cleaner dataset, let's remove those rows with missing values.

In [None]:
data_df = data_df.dropna()

Let's take a look at our Dataframe to make sure we successfully removed our rows with missing values:

In [None]:
data_df

Success!

</br>

Finally, let's check for duplicates:

In [None]:
data_df[data_df.duplicated(keep=False)]

There are no duplicates in our dataset!

### Automating the Retrieval of Data from URLs

Each article in this CSV file is paired with a URL. How can we actually use these URLs to get computationally tractable text data?

Though we could manually navigate to each URL and copy/paste each article into a file, that would be painstakingly slow, and we would lose crucial data in the process, for example information that might help us automatically distinguish the article headline from the body of the article. It would be much better to programmatically access the text data attached to every URL.

Now that we have a sample dataset, let's set up our code for scraping the text from the list of URLs stored in the URL column. </br>

Let's create a new function called scrape_article() that includes our requests.get() and response.text code.

In [None]:
def scrape_article(url):
    response = requests.get(url)
    response.encoding = 'utf-8'
    html_string = response.text
    return html_string

Let's apply our scrape_article function to the “URL” column of the DataFrame and create a new column "text" for the resulting extracted text:

In [None]:
data_df['text'] = data_df['url'].apply(scrape_article)

Let's take a look at our new dataframe:

In [None]:
data_df

### Retrieving the text from each URL

In the next few steps, we are going to build our for loop that will automate the process of retrieving the text from each URL. We will do this in steps to check what each line in the for loop is doing.

</br>

Let's start by looking at the data stored in our "text" column:

In [None]:
for text in data_df['text']:
    print(text)

We can see that our data is in HTML format and is hard to read. Let's run our data through our BeautifulSoup object, apply the get_text() function, and visualize our results using the print function:

In [None]:
for text in data_df['text']:
    soup = BeautifulSoup(text)
    article = soup.get_text()
    print(article)

Now, let's keep building our for loop, and save our data as a text file:

In [None]:
with open("all_articles.txt","w") as file:
    for text in data_df['text']:
        soup = BeautifulSoup(text)
        article = soup.get_text()
        file.write(article)


Let's check the "all_articles.txt" text file saved in our web_scraping_workshop folder. We have all the articles saved in one doc. This may be a useful way to save our data for some forms of analysis. That being said, if we plan on running any type of comparative analysis, we will need to have them saved as separate files.

</br>

Let's start by creating a new folder, "files", inside our web_scraping_workshop folder

In [None]:
! mkdir $path/files

Next, let's build on our for loop and create a file naming schema to save each article as an individual text file. To do that, we are going to create an index (id =) that starts at 0 and use an f-string (f") to iterate the file naming process across our list of articles using the index to add a number:  


In [None]:
id = 0
for text in data_df['text']:
    soup = BeautifulSoup(text)
    article = soup.get_text()

    id += 1
    with open(f"{path}/files/article_{id}.txt", "w") as file:
        file.write(str(article))

Let's check our files folder to see if our results were sucessfully saved...

In [None]:
! ls $path/files

Congrats, you just run your first autmation loop to web scrape a list of articles!

## Text Cleaning and Analysis

Now that we have our data saved in individual text files, we can run through the process of normalizing and cleaning our data. This includes making the text lowercase, stripping punctuation, and lemmatizing.

</br>

Once we are done normalizing and cleaning our data, we can then visualize our results in Word Clouds and run a lexical density analysis.

Let's start by importing all of our libraries:


In [None]:
import nltk
nltk.download('stopwords')
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')

from nltk.corpus import stopwords
stops = stopwords.words('english')

from nltk import pos_tag
from nltk.corpus import wordnet
from nltk.stem.wordnet import WordNetLemmatizer

from wordcloud import WordCloud
import matplotlib.pyplot as plt

import glob

Next, let's create two new folders within the "web_scraping_workshop" folder: one called "files_cleaned" where we will save our normalized and cleaned files and another called "wordclouds" where we will save our word cloud outputs

In [None]:
! mkdir $path/files_cleaned

In [None]:
! mkdir $path/wordclouds

Let's use the Glob library to connect to our "files" directory and set it equal to the variable "files." This will turn our file directory into a list of filepaths

In [None]:
directory = f"{path}/files"
files = glob.glob(f"{directory}/*.txt")

Let's take a look at our "files" variable:

In [None]:
files

We can see the contents of our "files" directory as a list.

Let's create a function to derive the Part of Speech (POS) of given words. We will use this function to lemmatize our text based on the part of speech (POS) tag.

In [None]:
# Fcn source: https://medium.com/codex/a-beginners-guide-to-easily-create-a-word-cloud-in-python-7c3078c705b7
# and https://www.machinelearningplus.com/nlp/lemmatization-examples-python/)

def get_wordnet_pos(word):
    """Map POS tag to first character lemmatize() accepts"""
    tag = nltk.pos_tag([word])[0][1][0].upper()
    tag_dict = {"J": wordnet.ADJ,
                "N": wordnet.NOUN,
                "V": wordnet.VERB,
                "R": wordnet.ADV}

    return tag_dict.get(tag, wordnet.NOUN)

Let's build a for loop to clean the files saved in our "files" variable.
We have a few steps:
- Create an index (id =) that starts at 0 and use an f-string (f") to iterate the file naming process across our list of articles using the index to add a number
- We start begin our for loop by telling Python to go through each item in the "files" list
- For each file path, we want Python to do the following:
    - open the text file attached to the filepath and set it equal to the variable "text"
    - transform the words in variable "text" into tokens and set it equal to the variable "text_tokens"
    - process "text_tokens" for use with NLTK and set it equal to the variable "nltk_text"
    - the next three steps make the tokens lower case and removes punctuation (text_lower), removes stop words (text_stops), and lemmatizes the tokens (text_clean)


In [None]:
id = 0
lexical_density = []
for filepath in files:
    text = open(filepath, encoding='utf-8').read()
    text_tokens = nltk.word_tokenize(text)
    nltk_text = nltk.Text(text_tokens)
    text_lower = [t.lower() for t in nltk_text if t.isalnum()]
    text_stops = [t for t in text_lower if t not in stops]
    text_clean = [WordNetLemmatizer().lemmatize(t, get_wordnet_pos(t)) for t in text_stops]

# save cleaned files

    id += 1
    with open(f"{path}/files_cleaned/article_cleaned_{id}.txt", "w") as file:
        file.write(str(text_clean))

# create Word Clouds
    unique_string=(" ").join(text_clean)
    wordcloud = WordCloud(max_font_size=40).generate(unique_string)
    plt.figure()
    plt.imshow(wordcloud, interpolation="bilinear")
    plt.axis("off")
    plt.show()

# save Word Clouds

    id += 1
    wordcloud.to_file(f"{path}/wordclouds/word_cloud_{id}.png")

# Establish lexical density
    text_clean_slice = text_clean [0:600]
    ld_results = len(set(text_clean_slice)) / len(text_clean_slice)
    print(ld_results)
    ld_dict = {'File_name': filepath, 'lexical_density': ld_results}
    lexical_density.append(ld_dict)

print(lexical_density)

We can visualize the results of our lexical density analysis in a dataframe:

In [None]:
ld_df = pd.DataFrame(lexical_density)
ld_df = ld_df.sort_values(by='File_name', ascending=True)
ld_df


Congrats on making it to the end of this workshop!
</br>
We've only scratched the surface in terms of the web scraping capabilities of Beautiful Soup. You could run a script to collect all the links of a web page. I've even used it to automatical download an archive of PDF files hosted on a URL.
</br>
Happy coding!