## **Full disclosure -- I used ChatGPT for much of this code (accessed 1/5 through 1/7).**

# Week 1 - Measuring Meaning & Sampling
This week, we begin by "begging, borrowing and stealing" text from several
contexts of human communication (e.g., PDFs, HTML, Word) and preparing it for
machines to "read" and analyze so that we can begin to build our sample. This notebook outlines scraping text from the web, PDF and Word documents. Then we detail "spidering" or walking
through hyperlinks to build samples of online content, and using APIs,
Application Programming Interfaces, provided by webservices to access their
content. Along the way, we will use regular expressions, outlined in the
reading, to remove unwanted formatting and ornamentation. Next, we discuss
various text encodings, filtering and data structures in which text can be
placed for analysis. Finally, we ask you to begin building a corpus for preliminary analysis and articulate what your sample represents in context of your final project.

We made a python package just for this course: lucem_illud. If you haven't installed this package, you should run the following code first. You don't need to install the package later; all you need to do is just to import the package with: import lucem_illud. For your final projects, you may find it useful to [read the lucem_illud source code](https://github.com/UChicago-Computational-Content-Analysis/lucem_illud/tree/main/lucem_illud) and modify your code for your own interests.

In [None]:
!pip install git+https://github.com/UChicago-Computational-Content-Analysis/lucem_illud.git
#installing lucem_illud package
#lucem_illud is a Latin phrase meaning "that light", the insight we can discover in text data!
#If you get an error like "Access is denied", try running the `pip` command on the command line as an administrator.

If you're not familiar with jupyter notebook, you may wonder what the exclamation mark(!) at the beginning of the command does (or even what pip means). The exclamation mark enables us to execute Terminal commands in the notebook cells (e.g., run `!ls` to display files in the current folder).

There is also a special download required by the `lucem_illud` module in the module `spacy`. You will see this 'en' module later, but you should probably run the following 2 lines of code so you can import `lucem_illud`.

In [None]:
import spacy
!python -m spacy download en

For this notebook we will be using the following packages:

In [3]:
#Special module written for this class
#This provides access to data and to helper functions from previous weeks
import lucem_illud #pip install git+https://github.com/UChicago-Computational-Content-Analysis/lucem_illud.git

#All these packages need to be installed from pip
import requests #for http requests
import bs4 #called `beautifulsoup4`, an html parser
import pandas as pd #gives us DataFrames
import docx #reading MS doc files, install as `python-docx`

#Stuff for pdfs
#Install as `pdfminer2`
import pdfminer.pdfinterp
import pdfminer.converter
import pdfminer.layout
import pdfminer.pdfpage

#These come with Python
import re #for regexs
import urllib.parse #For joining urls
import io #for making http requests look like files
import json #For Tumblr API responses
import os.path #For checking if files exist
import os #For making directories

We will also be working on the following files/urls

In [4]:
wikipedia_base_url = 'https://en.wikipedia.org'
wikipedia_content_analysis = 'https://en.wikipedia.org/wiki/Content_analysis'
content_analysis_save = 'wikipedia_content_analysis.html'
example_text_file = 'sometextfile.txt'
information_extraction_pdf = 'https://github.com/Computational-Content-Analysis-2018/Data-Files/raw/master/1-intro/Content%20Analysis%2018.pdf'
example_docx = 'https://github.com/Computational-Content-Analysis-2018/Data-Files/raw/master/1-intro/macs6000_connecting_to_midway.docx'
example_docx_save = 'example.docx'

# Scraping

Before we can start analyzing content we need to obtain it. Sometimes it will be
provided to us from a pre-curated text archive, but sometimes we will need to
download it. As a starting example we will attempt to download the wikipedia
page on content analysis. The page is located at [https://en.wikipedia.org/wiki/
Content_analysis](https://en.wikipedia.org/wiki/Content_analysis) so lets start
with that.

We can do this by making an HTTP GET request to that url, a GET request is
simply a request to the server to provide the contents given by some url. The
other request we will be using in this class is called a POST request and
requests the server to take some content we provide. While the Python standard
library does have the ability do make GET requests we will be using the
[_requests_](http://docs.python-requests.org/en/master/) package as it is _'the
only Non-GMO HTTP library for Python'_...also it provides a nicer interface.

In [5]:
#wikipedia_content_analysis = 'https://en.wikipedia.org/wiki/Content_analysis'
requests.get(wikipedia_content_analysis)

<Response [200]>

`'Response [200]'` means the server responded with what we asked for. If you get
another number (e.g. 404) it likely means there was some kind of error, these
codes are called HTTP response codes and a list of them can be found
[here](https://en.wikipedia.org/wiki/List_of_HTTP_status_codes). The response
object contains all the data the server sent including the website's contents
and the HTTP header. We are interested in the contents which we can access with
the `.text` attribute.

In [6]:
wikiContentRequest = requests.get(wikipedia_content_analysis)
print(wikiContentRequest.text[:1000])

<!DOCTYPE html>
<html class="client-nojs vector-feature-language-in-header-enabled vector-feature-language-in-main-page-header-disabled vector-feature-sticky-header-disabled vector-feature-page-tools-pinned-disabled vector-feature-toc-pinned-clientpref-1 vector-feature-main-menu-pinned-disabled vector-feature-limited-width-clientpref-1 vector-feature-limited-width-content-enabled vector-feature-zebra-design-enabled vector-feature-custom-font-size-clientpref-0 vector-feature-client-preferences-disabled vector-feature-client-prefs-pinned-disabled vector-toc-available" lang="en" dir="ltr">
<head>
<meta charset="UTF-8">
<title>Content analysis - Wikipedia</title>
<script>(function(){var className="client-js vector-feature-language-in-header-enabled vector-feature-language-in-main-page-header-disabled vector-feature-sticky-header-disabled vector-feature-page-tools-pinned-disabled vector-feature-toc-pinned-clientpref-1 vector-feature-main-menu-pinned-disabled vector-feature-limited-width-cli

This is not what we were looking for, because it is the start of the HTML that
makes up the website. This is HTML and is meant to be read by computers. Luckily
we have a computer to parse it for us. To do the parsing we will use [_Beautiful
Soup_](https://www.crummy.com/software/BeautifulSoup/) which is a better parser
than the one in the standard library.

But before we proceed to Beautiful Soup, a digression about Python syntax, especially about objects and functions.
For those who are not familiar with the syntax of python (or, if you're familiar with R programming), you might wonder what requests.get or wikiContentRequest.text mean. To understand this, you need to first understand what objects are. You may have heard that Python is an object oriented programming language (unlike the procedure oriented programming language, an example of which is R). Object is a set of variables (or, data) and functions into which you pass your data. So, in object oriented programming languages, like python, variables and functions are bunleded into objects.

For example, let's look at wikiContentRequest. We use dir() function, which returns the list of attributes and functions of objects.

In [7]:
 dir(wikiContentRequest)

['__attrs__',
 '__bool__',
 '__class__',
 '__delattr__',
 '__dict__',
 '__dir__',
 '__doc__',
 '__enter__',
 '__eq__',
 '__exit__',
 '__format__',
 '__ge__',
 '__getattribute__',
 '__getstate__',
 '__gt__',
 '__hash__',
 '__init__',
 '__init_subclass__',
 '__iter__',
 '__le__',
 '__lt__',
 '__module__',
 '__ne__',
 '__new__',
 '__nonzero__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__setattr__',
 '__setstate__',
 '__sizeof__',
 '__str__',
 '__subclasshook__',
 '__weakref__',
 '_content',
 '_content_consumed',
 '_next',
 'apparent_encoding',
 'close',
 'connection',
 'content',
 'cookies',
 'elapsed',
 'encoding',
 'headers',
 'history',
 'is_permanent_redirect',
 'is_redirect',
 'iter_content',
 'iter_lines',
 'json',
 'links',
 'next',
 'ok',
 'raise_for_status',
 'raw',
 'reason',
 'request',
 'status_code',
 'text',
 'url']

There's 'text' here. We used 'wikiContentRequest.text' to access 'text.' In other words, we use .(dot notation) to access functions from objects. wikiContentRequest has a set of functions, as shown above, and we used 'wikiContentRequest.text' to access one of them. By the way, dot notations do not necessarily refer to functions--it refers to anything that the entity contains.



Moving on to the next step: BeautifulSoup, a Python library which extracts data from HTML and XML, and transforms HTML files into Python objects.

In [8]:
wikiContentSoup = bs4.BeautifulSoup(wikiContentRequest.text, 'html.parser')
print(wikiContentSoup.text[:200])





Content analysis - Wikipedia




































Jump to content







Main menu





Main menu
move to sidebar
hide



		Navigation
	


Main pageContentsCurrent eventsRandom arti


This is better but there's still random whitespace and we have more than just
the text of the article. This is because what we requested is the whole webpage,
not just the text for the article.

We want to extract only the text we care about, and in order to do this we will
need to inspect the html. One way to do this is simply to go to the website with
a browser and use its inspection or view source tool. If javascript or other
dynamic loading occurs on the page, however, it is likely that what Python
receives is not what you will see, so we will need to inspect what Python
receives. To do this we can save the html `requests` obtained.

In [9]:
#content_analysis_save = 'wikipedia_content_analysis.html'

with open(content_analysis_save, mode='w', encoding='utf-8') as f:
    f.write(wikiContentRequest.text)

open() is a function which literally opens and returns the file. This function has multiple modes, and, here, we used mode as 'w', which means: open a file for writing. And then, we use 'write' function to write on the empty file (content_analysis_save) that we created using open(content_analysis_save, mode='w', encoding='utf-8').} What did we write on this file? The text we got from wikiContentRequest.text

Now let's open the file (`wikipedia_content_analysis.html`) we just created with
a web browser. It should look sort of like the original but without the images
and formatting.

As there is very little standardization on structuring webpages, figuring out
how best to extract what you want is an art. Looking at this page it looks like
all the main textual content is inside `<p>`(paragraph) tags within the `<body>`
tag.

In [10]:
contentPTags = wikiContentSoup.body.findAll('p')
for pTag in contentPTags[:3]:
    print(pTag.text)

Content analysis is the study of documents and communication artifacts, which might be texts of various formats, pictures, audio or video. Social scientists use content analysis to examine patterns in communication in a replicable and systematic manner.[1] One of the key advantages of using content analysis to analyse social phenomena is their non-invasive nature, in contrast to simulating social experiences or collecting survey answers.

Practices and philosophies of content analysis vary between academic disciplines. They all involve systematic reading or observation of texts or artifacts which are assigned labels (sometimes called codes) to indicate the presence of interesting, meaningful pieces of content.[2][3] By systematically labeling the content of a set of texts, researchers can analyse patterns of content quantitatively using statistical methods, or use qualitative methods to analyse meanings of content within texts.

Computers are increasingly used in content analysis to au

Another excursion for those who are not familiar with programming: for loop. For loop is used to iterate over a sequence. "ContentPTags" contains multiple paragraphs, each of which starts and ends with `<p>`. What the "for pTag in contentPtags[:3]" does here is: find each paragraph in contentPTags, which, here, we limited to the first three using contentPtags[:3], and then print each paragraph. So, we have three paragraphs. By the way, you can insert `<p>` in juputer notebook!

We now have all the text from the page, split up by paragraph. If we wanted to
get the section headers or references as well it would require a bit more work,
but is doable.

There is one more thing we might want to do before sending this text to be
processed, remove the references indicators (`[2]`, `[3]` , etc). To do this we
can use a short regular expression (regex).

In [11]:
contentParagraphs = []
for pTag in contentPTags:
    #strings starting with r are raw so their \'s are not modifier characters
    #If we didn't start with r the string would be: '\\[\\d+\\]'
    contentParagraphs.append(re.sub(r'\[\d+\]', '', pTag.text))

#convert to a DataFrame
contentParagraphsDF = pd.DataFrame({'paragraph-text' : contentParagraphs})
print(contentParagraphsDF)

                                       paragraph-text
0   Content analysis is the study of documents and...
1   Practices and philosophies of content analysis...
2   Computers are increasingly used in content ana...
3   Content analysis is best understood as a broad...
4   The simplest and most objective form of conten...
5   A further step in analysis is the distinction ...
6   Quantitative content analysis highlights frequ...
7   Siegfried Kracauer provides a critique of quan...
8   The data collection instrument used in content...
9   According to current standards of good scienti...
10  Furthermore, the Database of Variables for Con...
11  With the rise of common computing facilities l...
12  By having contents of communication available ...
13  Computer-assisted analysis can help with large...
14  Robert Weber notes: "To make valid inferences ...
15  According to today's reporting standards, quan...
16  There are five types of texts in content analy...
17  Content analysis is rese

Since we learned how to do for loop, you might get what we did here: using contentParagraphs = [], we made an empty list; and then, for each paragraph in contentPTags, we substituted every [\d+\] with '', i.e., removed every [\d+\], and then appended each paragraph (now without [\d+\]) to the empty list. As we can see, we have a dataframe, each row of which is each paragraph of contentPTags, without reference indicators.

By the way, what does [\d+\] mean? If you are not familiar with regex, it is a way of specifying searches in text.
A regex engine takes in the search pattern, in the above case `'\[\d+\]'` and
some string, the paragraph texts. Then it reads the input string one character
at a time checking if it matches the search. Here the regex `'\d'` matches
number characters (while `'\['` and `'\]'` capture the braces on either side).

Now we have a `DataFrame` containing all relevant text from the page ready to be processed

In [12]:
findNumber = r'\d'
regexResults = re.search(findNumber, 'not a number, not a number, numbers 2134567890, not a number')
regexResults

<re.Match object; span=(36, 37), match='2'>

In Python the regex package (`re`) usually returns `Match` objects (you can have
multiple pattern hits in a a single `Match`), to get the string that matched our
pattern we can use the `.group()` method, and as we want the first one we will
ask for the 0'th group.

In [13]:
print(regexResults.group(0))

2


That gives us the first number, if we wanted the whole block of numbers we can
add a wildcard `'+'` which requests 1 or more instances of the preceding
character.

In [14]:
findNumbers = r'\d+'
regexResults = re.search(findNumbers, 'not a number, not a number, numbers 2134567890, not a number')
print(regexResults.group(0))

2134567890


Now we have the whole block of numbers, there are a huge number of special
characters in regex, for the full description of Python's implementation look at
the [re docs](https://docs.python.org/3/library/re.html) there is also a short
[tutorial](https://docs.python.org/3/howto/regex.html#regex-howto).

# <font color="red">Exercise 1</font>
<font color="red">Construct cells immediately below this that describe and download webcontent relating to your anticipated final project. Use beautiful soup and at least five regular expressions to extract relevant, nontrivial *chunks* of that content (e.g., cleaned sentences, paragraphs, etc.) to a pandas `Dataframe`.</font>

**Here, I scape the steamcharts page to get the player counts for each month, for various games. I limited the scraping to a subset of games since getting all of them would be computationally-intensive. For a full scale analysis, I could use spidering to get these game_ids below.**

In [15]:
#games to have info scraped from
game_ids = ["218620", "231430", "242700", "200510", "204300", "219740", "44350", "1250", "35720"]

In [16]:
#TEST CASE, to make sure the code works
steamcharts_url = "https://steamcharts.com/app/1250"

# Send a GET request to the URL
steamplrs = requests.get(steamcharts_url)

# Parse the HTML content using BeautifulSoup
steamSoup = bs4.BeautifulSoup(steamplrs.content, 'html.parser')

# Find all <tr> tags
tr_tags = steamSoup.find_all('tr')

# Print or process the <tr> tags as needed
for tr_tag in tr_tags:
    print(tr_tag)

<tr>
<th class="left">Month</th>
<th class="right">Avg. Players</th>
<th class="right">Gain</th>
<th class="right">% Gain</th>
<th class="right">Peak Players</th>
</tr>
<tr class="odd">
<td class="month-cell left italic">Last 30 Days</td>
<td class="right num-f italic">236.74</td>
<td class="right num-p gainorloss italic">+11.7</td>
<td class="right gainorloss italic">+5.20%</td>
<td class="right num italic">425</td>
</tr>
<tr>
<td class="month-cell left">
					December 2023
				</td>
<td class="right num-f">225.03</td>
<td class="right num-p gainorloss">-10.77</td>
<td class="right gainorloss">-4.57%</td>
<td class="right num">425</td>
</tr>
<tr class="odd">
<td class="month-cell left">
					November 2023
				</td>
<td class="right num-f">235.80</td>
<td class="right num-p gainorloss">-5.14</td>
<td class="right gainorloss">-2.14%</td>
<td class="right num">503</td>
</tr>
<tr>
<td class="month-cell left">
					October 2023
				</td>
<td class="right num-f">240.95</td>
<td class="right

In [17]:
# Initialize lists to store the data
months_data = []
numbers_data = []

# Extract data from <tr> tags
for tr_tag in tr_tags:
    # Find all <td class="month-cell left"> and <td class="right num-f"> within the current <tr> tag
    month_cells = tr_tag.find_all('td', class_='month-cell left')
    num_cells = tr_tag.find_all('td', class_='right num-f')

    # Extract text content from each <td> and append to the respective lists
    months_data.extend(cell.get_text(strip=True) for cell in month_cells)
    numbers_data.extend(cell.get_text(strip=True) for cell in num_cells)

# Print or process the extracted data
print("Months data:", months_data)
print("Numbers data:", numbers_data)

Months data: ['December 2023', 'November 2023', 'October 2023', 'September 2023', 'August 2023', 'July 2023', 'June 2023', 'May 2023', 'April 2023', 'March 2023', 'February 2023', 'January 2023', 'December 2022', 'November 2022', 'October 2022', 'September 2022', 'August 2022', 'July 2022', 'June 2022', 'May 2022', 'April 2022', 'March 2022', 'February 2022', 'January 2022', 'December 2021', 'November 2021', 'October 2021', 'September 2021', 'August 2021', 'July 2021', 'June 2021', 'May 2021', 'April 2021', 'March 2021', 'February 2021', 'January 2021', 'December 2020', 'November 2020', 'October 2020', 'September 2020', 'August 2020', 'July 2020', 'June 2020', 'May 2020', 'April 2020', 'March 2020', 'February 2020', 'January 2020', 'December 2019', 'November 2019', 'October 2019', 'September 2019', 'August 2019', 'July 2019', 'June 2019', 'May 2019', 'April 2019', 'March 2019', 'February 2019', 'January 2019', 'December 2018', 'November 2018', 'October 2018', 'September 2018', 'August 

In [18]:
#FULL TEXT RETRIEVAL, getting all of the games I want for the dataframe
# List of game_ids
game_ids = ["218620", "231430", "242700", "200510", "204300", "219740", "44350", "1250", "35720"]

# Initialize an empty DataFrame named "game_data"
game_data = pd.DataFrame(columns=["game_id", "months_data", "numbers_data"])

# Iterate through each game_id
for game_id in game_ids:
    # URL for the current game_id
    url = f"https://steamcharts.com/app/{game_id}"

    # Send a GET request to the URL
    response = requests.get(url)

    # Check if the request was successful (status code 200)
    if response.status_code == 200:
        # Parse the HTML content using BeautifulSoup
        soup = bs4.BeautifulSoup(response.content, 'html.parser')

        # Find all <tr> tags
        tr_tags = soup.find_all('tr')

        # Initialize lists to store the data
        months_data = []
        numbers_data = []

        # Extract data from <tr> tags
        for tr_tag in tr_tags:
            # Find all <td class="month-cell left"> and <td class="right num-f"> within the current <tr> tag
            month_cells = tr_tag.find_all('td', class_='month-cell left')
            num_cells = tr_tag.find_all('td', class_='right num-f')

            # Extract text content from each <td> and append to the respective lists
            months_data.extend(cell.get_text(strip=True) for cell in month_cells)
            numbers_data.extend(cell.get_text(strip=True) for cell in num_cells)

        # Create a dictionary with the data
        data_dict = {"game_id": game_id, "months_data": months_data, "numbers_data": numbers_data}

        # Append the data to the DataFrame
        game_data = game_data.append(data_dict, ignore_index=True)

    else:
        print(f"Failed to retrieve the webpage for game_id {game_id}. Status Code: {response.status_code}")

#view game dataframe
game_data.head(5)

Unnamed: 0,game_id,months_data,numbers_data
0,218620,"[December 2023, November 2023, October 2023, S...","[26041.68, 26465.88, 25632.99, 28358.07, 29461..."
1,231430,"[December 2023, November 2023, October 2023, S...","[3821.00, 3803.77, 3930.76, 3844.39, 3773.83, ..."
2,242700,"[December 2023, November 2023, October 2023, S...","[37.72, 40.85, 49.12, 39.77, 46.47, 58.35, 50...."
3,200510,"[December 2023, November 2023, October 2023, S...","[607.21, 547.19, 512.43, 578.92, 611.38, 727.7..."
4,204300,"[December 2023, November 2023, October 2023, S...","[23.38, 22.62, 22.80, 63.43, 108.09, 109.38, 1..."


In [19]:
# Separate "months_data" so individual dates in each row
date_counts_rows = []
for _, row in game_data.iterrows():
    for date_value, number_value in zip(row['months_data'], row['numbers_data']):
        date_counts_row = {
            "game_id": row["game_id"],
            "date_count_value": date_value,
            "numbers_data": number_value
        }
        date_counts_rows.append(date_counts_row)

# Create a new DataFrame with the separated values and call it "date_counts"
date_counts = pd.DataFrame(date_counts_rows)

#view cleaned dataframe
date_counts.head(5)

Unnamed: 0,game_id,date_count_value,numbers_data
0,218620,December 2023,26041.68
1,218620,November 2023,26465.88
2,218620,October 2023,25632.99
3,218620,September 2023,28358.07
4,218620,August 2023,29461.89


In [20]:
#Player Counts by Month (just further cleaning of the labels)
player_counts = date_counts.rename(columns={"date_count_value": "date", "numbers_data": "player_count"})
player_counts.head(5)

Unnamed: 0,game_id,date,player_count
0,218620,December 2023,26041.68
1,218620,November 2023,26465.88
2,218620,October 2023,25632.99
3,218620,September 2023,28358.07
4,218620,August 2023,29461.89


In [21]:
#Need to remove extraneous month
player_counts = player_counts.drop(player_counts.index[-1])

In [22]:
#Add new column so have clear temporal order
# Reverse the order of the entire DataFrame
player_counts = player_counts.iloc[::-1].reset_index(drop=True)

# Calculate the number of rows in player_counts
num_rows = len(player_counts)

# Initialize variables for the step size and current time value
step_size = 1
current_time = 1

# Create a list with values starting from 1 and incrementing by 1, resetting for each game ID
times = []
current_game_id = None

for _, row in player_counts.iterrows():
    if current_game_id != row['game_id']:
        current_time = 1
        current_game_id = row['game_id']
    times.append(current_time)
    current_time += 1

# Add a new column "time" to the player_counts DataFrame
player_counts["time"] = times

#change order back
player_counts = player_counts.iloc[::-1].reset_index(drop=True)

#check order
print(player_counts.head(527))

    game_id            date player_count  time
0    218620   December 2023     26041.68   137
1    218620   November 2023     26465.88   136
2    218620    October 2023     25632.99   135
3    218620  September 2023     28358.07   134
4    218620     August 2023     29461.89   133
..      ...             ...          ...   ...
522  200510    January 2013      4891.62     4
523  200510   December 2012      5211.39     3
524  200510   November 2012      7653.91     2
525  200510    October 2012     20353.15     1
526  204300   December 2023        23.38   137

[527 rows x 4 columns]


**This information can be used for various purposes. It can be used for unsupervised classification (to see which time points seem to contribute most to player counts) or a more guided analysis of how certain time points (free promotion periods) contribute to player counts.** *Note:* game_id refers to games which had a free promotion at some point, based on their steam id number.


# Spidering

What if we want to to get a bunch of different pages from wikipedia. We would
need to get the url for each of the pages we want. Typically, we want pages that
are linked to by other pages and so we will need to parse pages and identify the
links. Right now we will be retrieving all links in the body of the content
analysis page.

To do this we will need to find all the `<a>` (anchor) tags with `href`s
(hyperlink references) inside of `<p>` tags. `href` can have many
[different](http://stackoverflow.com/questions/4855168/what-is-href-and-why-is-
it-used) [forms](https://en.wikipedia.org/wiki/Hyperlink#Hyperlinks_in_HTML) so
dealing with them can be tricky, but generally, you will want to extract
absolute or relative links. An absolute link is one you can follow without
modification, while a relative link requires a base url that you will then
append. Wikipedia uses relative urls for its internal links: below is an example
for dealing with them.

In [23]:
#wikipedia_base_url = 'https://en.wikipedia.org'

otherPAgeURLS = []
#We also want to know where the links come from so we also will get:
#the paragraph number
#the word the link is in
for paragraphNum, pTag in enumerate(contentPTags):
    #we only want hrefs that link to wiki pages
    tagLinks = pTag.findAll('a', href=re.compile('/wiki/'), class_=False)
    for aTag in tagLinks:
        #We need to extract the url from the <a> tag
        relurl = aTag.get('href')
        linkText = aTag.text
        #wikipedia_base_url is the base we can use the urllib joining function to merge them
        #Giving a nice structured tupe like this means we can use tuple expansion later
        otherPAgeURLS.append((
            urllib.parse.urljoin(wikipedia_base_url, relurl),
            paragraphNum,
            linkText,
        ))
print(otherPAgeURLS[:10])

[('https://en.wikipedia.org/wiki/Document', 0, 'documents'), ('https://en.wikipedia.org/wiki/Text_(literary_theory)', 1, 'texts'), ('https://en.wikipedia.org/wiki/Coding_(social_sciences)', 1, 'assigned labels (sometimes called codes)'), ('https://en.wikipedia.org/wiki/Semantics', 1, 'meaningful'), ('https://en.wikipedia.org/wiki/Text_(literary_theory)', 1, 'texts'), ('https://en.wikipedia.org/wiki/Quantitative_research', 1, 'quantitatively'), ('https://en.wikipedia.org/wiki/Statistics', 1, 'statistical methods'), ('https://en.wikipedia.org/wiki/Qualitative_research', 1, 'qualitative'), ('https://en.wikipedia.org/wiki/Text_(literary_theory)', 1, 'texts'), ('https://en.wikipedia.org/wiki/Machine_learning', 2, 'Machine learning')]


In [24]:
print(contentPTags)

[<p><b>Content analysis</b> is the study of <a href="/wiki/Document" title="Document">documents</a> and communication artifacts, which might be texts of various formats, pictures, audio or video. Social scientists use content analysis to examine patterns in communication in a replicable and systematic manner.<sup class="reference" id="cite_ref-1"><a href="#cite_note-1">[1]</a></sup> One of the key advantages of using content analysis to analyse social phenomena is their non-invasive nature, in contrast to simulating social experiences or collecting survey answers.
</p>, <p>Practices and philosophies of content analysis vary between academic disciplines. They all involve systematic reading or observation of <a href="/wiki/Text_(literary_theory)" title="Text (literary theory)">texts</a> or artifacts which are <a href="/wiki/Coding_(social_sciences)" title="Coding (social sciences)">assigned labels (sometimes called codes)</a> to indicate the presence of interesting, <a href="/wiki/Semant

Another excursion: Why do we use enumerate() here? enumerate() takes a collection, enumerates, and returns an enumate object with both the numbers and the collection. For example, contentPTags (the collection we used here) is comprised of paragraphs. We want the paragraph number of each paragraph. And this is what enumerate() does: it returns the paragraph number and the paragraph.

We will be adding these new texts to our DataFrame `contentParagraphsDF` so we
will need to add 2 more columns to keep track of paragraph numbers and sources.

In [25]:
contentParagraphsDF['source'] = [wikipedia_content_analysis] * len(contentParagraphsDF['paragraph-text'])
contentParagraphsDF['paragraph-number'] = range(len(contentParagraphsDF['paragraph-text']))

contentParagraphsDF

Unnamed: 0,paragraph-text,source,paragraph-number
0,Content analysis is the study of documents and...,https://en.wikipedia.org/wiki/Content_analysis,0
1,Practices and philosophies of content analysis...,https://en.wikipedia.org/wiki/Content_analysis,1
2,Computers are increasingly used in content ana...,https://en.wikipedia.org/wiki/Content_analysis,2
3,Content analysis is best understood as a broad...,https://en.wikipedia.org/wiki/Content_analysis,3
4,The simplest and most objective form of conten...,https://en.wikipedia.org/wiki/Content_analysis,4
5,A further step in analysis is the distinction ...,https://en.wikipedia.org/wiki/Content_analysis,5
6,Quantitative content analysis highlights frequ...,https://en.wikipedia.org/wiki/Content_analysis,6
7,Siegfried Kracauer provides a critique of quan...,https://en.wikipedia.org/wiki/Content_analysis,7
8,The data collection instrument used in content...,https://en.wikipedia.org/wiki/Content_analysis,8
9,According to current standards of good scienti...,https://en.wikipedia.org/wiki/Content_analysis,9


Then we can add two more columns to our `Dataframe` and define a function to
parse
each linked page and add its text to our DataFrame.

In [26]:
contentParagraphsDF['source-paragraph-number'] = [None] * len(contentParagraphsDF['paragraph-text'])
contentParagraphsDF['source-paragraph-text'] = [None] * len(contentParagraphsDF['paragraph-text'])

def getTextFromWikiPage(targetURL, sourceParNum, sourceText):
    #Make a dict to store data before adding it to the DataFrame
    parsDict = {'source' : [], 'paragraph-number' : [], 'paragraph-text' : [], 'source-paragraph-number' : [],  'source-paragraph-text' : []}
    #Now we get the page
    r = requests.get(targetURL)
    soup = bs4.BeautifulSoup(r.text, 'html.parser')
    #enumerating gives use the paragraph number
    for parNum, pTag in enumerate(soup.body.findAll('p')):
        #same regex as before
        parsDict['paragraph-text'].append(re.sub(r'\[\d+\]', '', pTag.text))
        parsDict['paragraph-number'].append(parNum)
        parsDict['source'].append(targetURL)
        parsDict['source-paragraph-number'].append(sourceParNum)
        parsDict['source-paragraph-text'].append(sourceText)
    return pd.DataFrame(parsDict)

And run it on our list of link tags

In [27]:
for urlTuple in otherPAgeURLS[:3]:
    #ignore_index means the indices will not be reset after each append
    contentParagraphsDF = contentParagraphsDF.append(getTextFromWikiPage(*urlTuple),ignore_index=True)
contentParagraphsDF

Unnamed: 0,paragraph-text,source,paragraph-number,source-paragraph-number,source-paragraph-text
0,Content analysis is the study of documents and...,https://en.wikipedia.org/wiki/Content_analysis,0,,
1,Practices and philosophies of content analysis...,https://en.wikipedia.org/wiki/Content_analysis,1,,
2,Computers are increasingly used in content ana...,https://en.wikipedia.org/wiki/Content_analysis,2,,
3,Content analysis is best understood as a broad...,https://en.wikipedia.org/wiki/Content_analysis,3,,
4,The simplest and most objective form of conten...,https://en.wikipedia.org/wiki/Content_analysis,4,,
...,...,...,...,...,...
58,Much of qualitative coding can be attributed t...,https://en.wikipedia.org/wiki/Coding_(social_s...,8,1,assigned labels (sometimes called codes)
59,Coding is considered a process of discovery an...,https://en.wikipedia.org/wiki/Coding_(social_s...,9,1,assigned labels (sometimes called codes)
60,"The process can be done manually, which can be...",https://en.wikipedia.org/wiki/Coding_(social_s...,10,1,assigned labels (sometimes called codes)
61,After assembling codes it is time to organize ...,https://en.wikipedia.org/wiki/Coding_(social_s...,11,1,assigned labels (sometimes called codes)



# <font color="red">Exercise 2</font>
<font color="red">Construct cells immediately below this that spider webcontent from another site with content relating to your anticipated final project. Specifically, identify urls on a core page, then follow and extract content from them into a pandas `Dataframe`. In addition, demonstrate a *recursive* spider, which follows more than one level of links (i.e., follows links from a site, then follows links on followed sites to new sites, etc.), making sure to define a reasonable endpoint so that you do not wander the web forever :-).</font>



***NOTE:* The following set of code leading up to the first table is for my convenience. It does use spidering to get the post dates, but this is an API. The following set of code, to be marked, is where all the spidering is.**

In [28]:
!pip install praw
!pip install asyncpraw
!pip install nest_asyncio
import praw
import asyncio
import nest_asyncio
import asyncpraw

Collecting praw
  Downloading praw-7.7.1-py3-none-any.whl (191 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m191.0/191.0 kB[0m [31m1.5 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting prawcore<3,>=2.1 (from praw)
  Downloading prawcore-2.4.0-py3-none-any.whl (17 kB)
Collecting update-checker>=0.18 (from praw)
  Downloading update_checker-0.18.0-py3-none-any.whl (7.0 kB)
Installing collected packages: update-checker, prawcore, praw
Successfully installed praw-7.7.1 prawcore-2.4.0 update-checker-0.18.0
Collecting asyncpraw
  Downloading asyncpraw-7.7.1-py3-none-any.whl (196 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m196.7/196.7 kB[0m [31m3.4 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting aiofiles<1 (from asyncpraw)
  Downloading aiofiles-0.8.0-py3-none-any.whl (13 kB)
Collecting aiosqlite<=0.17.0 (from asyncpraw)
  Downloading aiosqlite-0.17.0-py3-none-any.whl (15 kB)
Collecting asyncprawcore<3,>=2.1 (from asyncpraw)
  Downloading asyncpra

In [29]:
#needs asynchronous; otherwise, doesn't like my entry information for some reason
nest_asyncio.apply()

#placeholder
post_titles = None

#command to get posts from subreddit
async def fetch_posts():
    global post_titles_df  # Declare the global variable

    username = "sacredpineapple01"
    password = "k$J)Z^?bM-g5Mce"
    client_secret = "y7YTYGK41YKTXppYeP_gi3LLgZKGtg"
    client_id = "g48GOMiSTT3a1E106hJunA"
    user_agent = "u/sacredpineapple01"

    # Create an asynchronous Reddit instance with OAuth2 authentication
    reddit_async = asyncpraw.Reddit(
        client_id=client_id,
        client_secret=client_secret,
        username=username,
        password=password,
        user_agent=user_agent
    )

    # Specify the subreddit you want to search
    subreddit_name = 'GameDeals'
    subreddit_async = await reddit_async.subreddit(subreddit_name)

    # Specify the search term
    search_term = 'free weekend'

    # Fetch posts containing the search term
    posts_data = []
    async for submission in subreddit_async.search(search_term, limit=100):
        post_title = submission.title
        post_date = pd.to_datetime(submission.created_utc, unit='s')  # Convert epoch time to datetime
        posts_data.append({'Post Title': post_title, 'Post Date': post_date})

    # Create a DataFrame
    post_df = pd.DataFrame(posts_data)

    # Close the asyncio event loop
    await reddit_async.close()

    return post_df

# Run the asynchronous code using asyncio.run()
post_titles = asyncio.run(fetch_posts())

In [30]:
#cleaning up the data so doesn't include extraneous information, and only includes Steam promotions
post_titles = pd.DataFrame(post_titles)
post_titles = post_titles[post_titles['Post Title'].str.contains('[sS]team')]

post_titles['game'] = post_titles['Post Title'].str.replace(r'\s*\([^)]*\).*', '')
post_titles['game'] = post_titles['game'].str.replace(r'\s*\$.*', '')
post_titles['game'] = post_titles['game'].str.replace(r'\[steam\] ', '', case=False)
post_titles['game'] = post_titles['game'].str.replace(r'Weekend Deal: ', '', case=False)

post_titles.drop(columns=["Post Title"], inplace=True)

In [31]:
# Convert "Post Date" to datetime format
post_titles['Post Date'] = pd.to_datetime(post_titles['Post Date'])

# Create new columns "month_promo" and "year_promo"
post_titles['month_promo'] = post_titles['Post Date'].dt.month
post_titles['year_promo'] = post_titles['Post Date'].dt.year

post_titles.head(5)

Unnamed: 0,Post Date,game,month_promo,year_promo
0,2018-07-05 17:11:59,Rocket League®,7,2018
2,2021-10-15 11:27:19,Bordelands 3,10,2021
3,2019-01-17 18:07:47,Dead by Daylight,1,2019
6,2023-02-17 07:36:42,Age of Empires IV,2,2023
7,2018-04-05 17:11:56,Crusader Kings II,4,2018


**Below is a separate analysis, with spidering (including recursive spidering).**

In [32]:
# URL of the top games wikipedia page
base_page = 'https://en.wikipedia.org/wiki/List_of_best-selling_PC_games'

# Fetch the HTML content
response = requests.get(base_page)
html_content = response.content

# Create a BeautifulSoup object
soup = bs4.BeautifulSoup(html_content, 'html.parser')

# Find the second table on the page
second_table = soup.find_all('table')[1]

# Find all <tr> tags within the <tbody> of the second table
tr_tags = second_table.find('tbody').find_all('tr')

# Extract the URLs from the first <a> tag within each <td>
url_list = [tr.find('td').find('a')['href'] for tr in tr_tags if tr.find('td') and tr.find('td').find('a')]

In [33]:
# Initialize a dictionary to store results (URLs for "Mode(s)")
result_dict = {}

# Check for "Mode(s)" in the HTML for each URL in url_list
for url in url_list:
    try:
        # Fetch the HTML content for the current URL
        response = requests.get(urllib.parse.urljoin(base_page, url))
        response.raise_for_status()  # Raise an error for bad responses (4xx and 5xx)
        html_content = response.content
        soup = bs4.BeautifulSoup(html_content, 'html.parser')

        # Find the first table on the page
        first_table = soup.find('table')

        if first_table:
            # Find the last <tr> in the table
            last_tr = first_table.find_all('tr')[-1]

            # Find the <td class="infobox-data"> in the last <tr>
            infobox_data_td = last_tr.find('td', class_='infobox-data')

            if infobox_data_td:
                # Check if there is an <a> tag within the infobox_data_td
                if infobox_data_td.find('a') is not None:
                    # Extract the URL contained within the <a> tag
                    url_result = infobox_data_td.find('a')['href']

                    # Store the result in the dictionary
                    result_dict[url] = url_result

    except requests.RequestException as e:
        # Handle request errors
        print(f"Error fetching {url}: {e}")

# Print the results
for key, value in result_dict.items():
    print(f"For URL: {key}, Mode(s) URL: {value}")

For URL: /wiki/Minecraft, Mode(s) URL: /wiki/Single-player
For URL: /wiki/Terraria, Mode(s) URL: /wiki/Single-player
For URL: /wiki/Diablo_III, Mode(s) URL: /wiki/Single-player
For URL: /wiki/Garry%27s_Mod, Mode(s) URL: /wiki/Single-player
For URL: /wiki/Rust_(video_game), Mode(s) URL: /wiki/Multiplayer_video_game
For URL: /wiki/World_of_Warcraft, Mode(s) URL: /wiki/Multiplayer_video_game
For URL: /wiki/Stardew_Valley, Mode(s) URL: /wiki/Single-player
For URL: /wiki/Half-Life_2, Mode(s) URL: /wiki/Single-player
For URL: /wiki/The_Witcher_3:_Wild_Hunt, Mode(s) URL: /wiki/Single-player
For URL: /wiki/The_Sims_(video_game), Mode(s) URL: /wiki/Single-player_video_game
For URL: /wiki/StarCraft_(video_game), Mode(s) URL: /wiki/Single-player
For URL: /wiki/RollerCoaster_Tycoon_3, Mode(s) URL: /wiki/Single-player_video_game
For URL: /wiki/Fall_Guys, Mode(s) URL: /wiki/Multiplayer_video_game
For URL: /wiki/Civilization_V, Mode(s) URL: /wiki/Single-player_video_game
For URL: /wiki/Cyberpunk_2077

In [34]:
# Initialize lists to store results
for_url_list = []
mode_url_list = []

# Check for "Mode(s)" in the HTML for each URL in url_list
for url in url_list:
    try:
        # Fetch the HTML content for the current URL
        response = requests.get(urllib.parse.urljoin(base_page, url))
        response.raise_for_status()  # Raise an error for bad responses (4xx and 5xx)
        html_content = response.content
        soup = bs4.BeautifulSoup(html_content, 'html.parser')

        # Find the first table on the page
        first_table = soup.find('table')

        if first_table:
            # Find the last <tr> in the table
            last_tr = first_table.find_all('tr')[-1]

            # Find the <td class="infobox-data"> in the last <tr>
            infobox_data_td = last_tr.find('td', class_='infobox-data')

            if infobox_data_td:
                # Check if there is an <a> tag within the infobox_data_td
                if infobox_data_td.find('a') is not None:
                    # Extract the URL contained within the <a> tag
                    mode_url = infobox_data_td.find('a')['href']
                else:
                    mode_url = "NA"  # If no Mode(s) URL, set to "NA"
            else:
                mode_url = "NA"  # If no infobox_data_td, set to "NA"
        else:
            mode_url = "NA"  # If no first_table, set to "NA"

        # Store the results in the lists
        for_url_list.append(url)
        mode_url_list.append(mode_url)

    except requests.RequestException as e:
        # Handle request errors
        print(f"Error fetching {url}: {e}")
        for_url_list.append(url)
        mode_url_list.append("NA")  # Set to "NA" for error cases

# Create a DataFrame from the lists and name it "url_table"
url_table = pd.DataFrame({'For URL': for_url_list, 'Mode(s) URL': mode_url_list})

In [35]:
#cleaning the dataset by changing column names, fixing links, etc.
#new "game" column
url_table['game'] = url_table['For URL'].str.replace('/wiki/', '')

# Rename the "For URL" column to "game_wikipage"
url_table.rename(columns={'For URL': 'game_wikipage'}, inplace=True)

#new "mode" column
url_table['mode'] = url_table['Mode(s) URL'].str.replace('/wiki/', '')

# Rename the "Mode(s) URL" column to "mode_url"
url_table.rename(columns={'Mode(s) URL': 'mode_wikipage'}, inplace=True)

#fix link
url_table['game_wikipage'] = 'https://en.wikipedia.org/' + url_table['game_wikipage']
url_table.loc[url_table['mode'] != 'NA', 'mode_wikipage'] = 'https://en.wikipedia.org/' + url_table.loc[url_table['mode'] != 'NA', 'mode']

# Reorder columns and print the DataFrame
url_table = url_table[['game', 'game_wikipage', 'mode', 'mode_wikipage']]
url_table

Unnamed: 0,game,game_wikipage,mode,mode_wikipage
0,PUBG:_Battlegrounds,https://en.wikipedia.org//wiki/PUBG:_Battlegro...,,
1,Minecraft,https://en.wikipedia.org//wiki/Minecraft,Single-player,https://en.wikipedia.org/Single-player
2,Terraria,https://en.wikipedia.org//wiki/Terraria,Single-player,https://en.wikipedia.org/Single-player
3,Diablo_III,https://en.wikipedia.org//wiki/Diablo_III,Single-player,https://en.wikipedia.org/Single-player
4,Garry%27s_Mod,https://en.wikipedia.org//wiki/Garry%27s_Mod,Single-player,https://en.wikipedia.org/Single-player
...,...,...,...,...
176,Hearts_of_Iron_IV,https://en.wikipedia.org//wiki/Hearts_of_Iron_IV,Single-player_video_game,https://en.wikipedia.org/Single-player_video_game
177,Hollow_Knight,https://en.wikipedia.org//wiki/Hollow_Knight,Single-player,https://en.wikipedia.org/Single-player
178,Divinity:_Original_Sin_II,https://en.wikipedia.org//wiki/Divinity:_Origi...,Single-player,https://en.wikipedia.org/Single-player
179,Cuphead,https://en.wikipedia.org//wiki/Cuphead,Single-player,https://en.wikipedia.org/Single-player


**"Spidering" was done to get the mode URL from the game URLs. Less fancy spidering is done below (recursively), if that's what's wanted. In the code below, it just finds random URLs using "random," and prints them.**

In [36]:
import random

# Function to get all URLs from the top selling games page
def get_urls_from_page(url):
    response = requests.get(url)
    html_content = response.content
    soup = bs4.BeautifulSoup(html_content, 'html.parser')
    # Assuming you want to extract URLs from the first table on the page
    first_table = soup.find_all('table')[0]
    tr_tags = first_table.find('tbody').find_all('tr')

    urls = []
    for tr in tr_tags:
        td = tr.find('td')
        if td and td.find('a') and 'href' in td.find('a').attrs:
            urls.append(td.find('a')['href'])

    return urls

# URL of the top selling games wikipedia page
base_page = 'https://en.wikipedia.org/wiki/List_of_best-selling_PC_games'

# Get initial URLs from the base page
url_list = get_urls_from_page(base_page)

# Follow random URLs five times
for _ in range(5):
    # Check if url_list is not empty
    if url_list:
        # Choose a random URL from the list
        random_url = random.choice(url_list)

        # Construct the full URL
        full_url = f'https://en.wikipedia.org{random_url}'

        # Get URLs from the new page
        url_list = get_urls_from_page(full_url)

        # Print the current random URL
        print(full_url)
    else:
        print("No more URLs to follow.")
        break

https://en.wikipedia.org/wiki/Video_game_development
https://en.wikipedia.org/wiki/Video_game
https://en.wikipedia.org#Platform
https://en.wikipedia.org/wiki/File:Pinot_Grigio-20201027-RM-114053.jpg
No more URLs to follow.


## API (Tumblr)

Generally website owners do not like you scraping their sites. If done badly,
scarping can act like a DOS attack so you should be careful how often you make
calls to a site. Some sites want automated tools to access their data, so they
create [application programming interface
(APIs)](https://en.wikipedia.org/wiki/Application_programming_interface). An API
specifies a procedure for an application (or script) to access their data. Often
this is though a [representational state transfer
(REST)](https://en.wikipedia.org/wiki/Representational_state_transfer) web
service, which just means if you make correctly formatted HTTP requests they
will return nicely formatted data.

A nice example for us to study is [Tumblr](https://www.tumblr.com), they have a
[simple RESTful API](https://www.tumblr.com/docs/en/api/v1) that allows you to
read posts without any complicated html parsing.

We can get the first 20 posts from a blog by making an http GET request to
`'http://{blog}.tumblr.com/api/read/json'`, were `{blog}` is the name of the
target blog. Lets try and get the posts from [http://lolcats-lol-
cat.tumblr.com/](http://lolcats-lol-cat.tumblr.com/) (Note the blog says at the
top 'One hour one pic lolcats', but the canonical name that Tumblr uses is in
the URL 'lolcats-lol-cat').

In [37]:
tumblrAPItarget = 'http://{}.tumblr.com/api/read/json'

r = requests.get(tumblrAPItarget.format('lolcats-lol-cat'))

print(r.text[:1000])

var tumblr_api_read = {"tumblelog":{"title":"One hour one pic lolcats","description":"","name":"lolcats-lol-cat","timezone":"Europe\/Paris","cname":false,"feeds":[],"uuid":"t:nXFqyQsaizVnIxVAm-ttlA"},"posts-start":0,"posts-total":3926,"posts-type":false,"posts":[{"id":"679413944568430592","url":"https:\/\/lolcats-lol-cat.tumblr.com\/post\/679413944568430592","url-with-slug":"https:\/\/lolcats-lol-cat.tumblr.com\/post\/679413944568430592\/cat-cats-kitty-gato-saturday-meow-katze","type":"photo","date-gmt":"2022-03-22 09:00:29 GMT","date":"Tue, 22 Mar 2022 10:00:29","bookmarklet":0,"mobile":0,"feed-item":"","from-feed-id":0,"unix-timestamp":1647939629,"format":"html","reblog-key":"MDkw6o5C","slug":"cat-cats-kitty-gato-saturday-meow-katze","is-submission":false,"like-button":"<div class=\"like_button\" data-post-id=\"679413944568430592\" data-blog-name=\"lolcats-lol-cat\" id=\"like_button_679413944568430592\"><iframe id=\"like_iframe_679413944568430592\" src=\"https:\/\/assets.tumblr.com\/

This might not look very good on first inspection, but it has far fewer angle
braces than html, which makes it easier to parse. What we have is
[JSON](https://en.wikipedia.org/wiki/JSON) a 'human readable' text based data
transmission format based on javascript. Luckily, we can readily convert it to a
python `dict`.

In [38]:
#We need to load only the stuff between the curly braces
d = json.loads(r.text[len('var tumblr_api_read = '):-2])
print(d.keys())
print(len(d['posts']))

dict_keys(['tumblelog', 'posts-start', 'posts-total', 'posts-type', 'posts'])
20


If we read the [API specification](https://www.tumblr.com/docs/en/api/v1), we
will see there are a lot of things we can get if we add things to our GET
request. First we can retrieve posts by their id number. Let's first get post
`146020177084`.

In [39]:
r = requests.get(tumblrAPItarget.format('lolcats-lol-cat'), params = {'id' : 146020177084})
d = json.loads(r.text[len('var tumblr_api_read = '):-2])
d['posts'][0].keys()
d['posts'][0]['photo-url-1280']

with open('lolcat.gif', 'wb') as f:
    gifRequest = requests.get(d['posts'][0]['photo-url-1280'], stream = True)
    f.write(gifRequest.content)

<img src='lolcat.gif'>

Such beauty; such vigor (If you can't see it you have to refresh the page). Now
we could retrieve the text from all posts as well
as related metadata, like the post date, caption or tags. We could also get
links to all the images.

In [40]:
#Putting a max in case the blog has millions of images
#The given max will be rounded up to the nearest multiple of 50
def tumblrImageScrape(blogName, maxImages = 200):
    #Restating this here so the function isn't dependent on any external variables
    tumblrAPItarget = 'http://{}.tumblr.com/api/read/json'

    #There are a bunch of possible locations for the photo url
    possiblePhotoSuffixes = [1280, 500, 400, 250, 100]

    #These are the pieces of information we will be gathering,
    #at the end we will convert this to a DataFrame.
    #There are a few other datums we could gather like the captions
    #you can read the Tumblr documentation to learn how to get them
    #https://www.tumblr.com/docs/en/api/v1
    postsData = {
        'id' : [],
        'photo-url' : [],
        'date' : [],
        'tags' : [],
        'photo-type' : []
    }

    #Tumblr limits us to a max of 50 posts per request
    for requestNum in range(maxImages // 50):
        requestParams = {
            'start' : requestNum * 50,
            'num' : 50,
            'type' : 'photo'
        }
        r = requests.get(tumblrAPItarget.format(blogName), params = requestParams)
        requestDict = json.loads(r.text[len('var tumblr_api_read = '):-2])
        for postDict in requestDict['posts']:
            #We are dealing with uncleaned data, we can't trust it.
            #Specifically, not all posts are guaranteed to have the fields we want
            try:
                postsData['id'].append(postDict['id'])
                postsData['date'].append(postDict['date'])
                postsData['tags'].append(postDict['tags'])
            except KeyError as e:
                raise KeyError("Post {} from {} is missing: {}".format(postDict['id'], blogName, e))

            foundSuffix = False
            for suffix in possiblePhotoSuffixes:
                try:
                    photoURL = postDict['photo-url-{}'.format(suffix)]
                    postsData['photo-url'].append(photoURL)
                    postsData['photo-type'].append(photoURL.split('.')[-1])
                    foundSuffix = True
                    break
                except KeyError:
                    pass
            if not foundSuffix:
                #Make sure your error messages are useful
                #You will be one of the users
                raise KeyError("Post {} from {} is missing a photo url".format(postDict['id'], blogName))

    return pd.DataFrame(postsData)
tumblrImageScrape('lolcats-lol-cat', 50)

Unnamed: 0,id,photo-url,date,tags,photo-type
0,679413944568430592,https://64.media.tumblr.com/1fb7e346f39e428540...,"Tue, 22 Mar 2022 10:00:29","[gif, lolcat, lolcats, cat, funny, cats, kitty...",gif
1,662815854023655425,https://64.media.tumblr.com/021eac8fbcafbb00a5...,"Mon, 20 Sep 2021 06:00:56","[gif, lolcat, lolcats, cat, funny]",gif
2,662778109891952640,https://64.media.tumblr.com/8c0517adb8c71e4a3d...,"Sun, 19 Sep 2021 20:01:00","[cat, cats, lol, lolcat, lolcats]",png
3,662657302700146688,https://64.media.tumblr.com/061d27cda309d5c809...,"Sat, 18 Sep 2021 12:00:50","[cat, cats, lol, lolcat, lolcats]",jpg
4,662513901538246656,https://64.media.tumblr.com/80584a9d1ff4ddc4fc...,"Thu, 16 Sep 2021 22:01:32","[cat, cats, lol, lolcat, lolcats]",png
5,662257177983090688,https://64.media.tumblr.com/893b320cd2e8970a20...,"Tue, 14 Sep 2021 02:01:01","[cat, cats, lol, lolcat, lolcats]",jpg
6,662166591527698432,https://64.media.tumblr.com/c7f0a0a9184e480e15...,"Mon, 13 Sep 2021 02:01:11","[cat, cats, lol, lolcat, lolcats]",jpg
7,662113740899090432,https://64.media.tumblr.com/07f7be7f71917a6049...,"Sun, 12 Sep 2021 12:01:09","[cat, cats, lol, lolcat, lolcats]",jpg
8,661955166248026112,https://64.media.tumblr.com/205f030c48d31f8960...,"Fri, 10 Sep 2021 18:00:40","[cat, cats, lol, lolcat, lolcats]",jpg
9,661894830378614784,https://64.media.tumblr.com/c463bff883fec2045b...,"Fri, 10 Sep 2021 02:01:39","[cat, cats, lol, lolcat, lolcats]",png


Now we have the urls of a bunch of images and can run OCR on them to gather
compelling meme narratives, accompanied by cats.

# Files

What if the text we want isn't on a webpage? There are a many other sources of
text available, typically organized into *files*.

## Raw text (and encoding)

The most basic form of storing text is as a _raw text_ document. Source code
(`.py`, `.r`, etc) is usually raw text as are text files (`.txt`) and those with
many other extension (e.g., .csv, .dat, etc.). Opening an unknown file with a
text editor is often a great way of learning what the file is.

We can create a text file in python with the `open()` function

In [41]:
#example_text_file = 'sometextfile.txt'
#stringToWrite = 'A line\nAnother line\nA line with a few unusual symbols \u2421 \u241B \u20A0 \u20A1 \u20A2 \u20A3 \u0D60\n'
stringToWrite = 'A line\nAnother line\nA line with a few unusual symbols ␡ ␛ ₠ ₡ ₢ ₣ ൠ\n'

with open(example_text_file, mode = 'w', encoding='utf-8') as f:
    f.write(stringToWrite)

Notice the `encoding='utf-8'` argument, which specifies how we map the bits from
the file to the glyphs (and whitespace characters like tab (`'\t'`) or newline
(`'\n'`)) on the screen. When dealing only with latin letters, arabic numerals
and the other symbols on America keyboards you usually do not have to worry
about encodings as the ones used today are backwards compatible with
[ASCII](https://en.wikipedia.org/wiki/ASCII), which gives the binary
representation of 128 characters.

Some of you, however, will want to use other characters (e.g., Chinese
characters). To solve this there is
[Unicode](https://en.wikipedia.org/wiki/Unicode) which assigns numbers to
symbols, e.g., 041 is `'A'` and 03A3 is `'Σ'` (numbers starting with 0 are
hexadecimal). Often non/beyond-ASCII characters are called Unicode characters.
Unicode contains 1,114,112 characters, about 10\% of which have been assigned.
Unfortunately there are many ways used to map combinations of bits to Unicode
symbols. The ones you are likely to encounter are called by Python _utf-8_,
_utf-16_ and _latin-1_. _utf-8_ is the standard for Linux and Mac OS while both
_utf-16_ and _latin-1_ are used by windows. If you use the wrong encoding,
characters can appear wrong, sometimes change in number or Python could raise an
exception. Lets see what happens when we open the file we just created with
different encodings.

In [42]:
with open(example_text_file, encoding='utf-8') as f:
    print("This is with the correct encoding:")
    print(f.read())

with open(example_text_file, encoding='latin-1') as f:
    print("This is with the wrong encoding:")
    print(f.read())

This is with the correct encoding:
A line
Another line
A line with a few unusual symbols ␡ ␛ ₠ ₡ ₢ ₣ ൠ

This is with the wrong encoding:
A line
Another line
A line with a few unusual symbols â¡ â â  â¡ â¢ â£ àµ 



Notice that with _latin-1_ the unicode characters are mixed up and there are too
many of them. You need to keep in mind encoding when obtaining text files.
Determining the encoding can sometime involve substantial work.

We can also load many text files at once. Lets start by looking at the Shakespeare files in the `data` directory

In [None]:
with open('./data/Shakespeare/midsummer_nights_dream.txt') as f:
    midsummer = f.read()
print(midsummer[-700:])

By the way, depending on your working directory, you might get errors such as: [Errno 2] No such file or directory: '../data/Shakespeare/midsummer_nights_dream.txt.' Don't panic, it's nothing, just check your working directory.

Then to load all the files in `./data/Shakespeare` we can use a for loop with `scandir`:

In [None]:
targetDir = './data/Shakespeare' #Change this to your own directory of texts
shakespearText = []
shakespearFileName = []

for file in (file for file in os.scandir(targetDir) if file.is_file() and not file.name.startswith('.')):
    with open(file.path, encoding="utf-8") as f:
        shakespearText.append(f.read())
    shakespearFileName.append(file.name)

Then we can put them all in pandas DataFrame

In [None]:
shakespear_df = pd.DataFrame({'text' : shakespearText}, index = shakespearFileName)
shakespear_df

Getting your text in a format like this is the first step of most analysis

## PDF

Another common way text will be stored is in a PDF file. First we will download
a pdf in Python. To do that lets grab a chapter from
_Speech and Language Processing_, chapter 21 is on Information Extraction which
seems apt. It is stored as a pdf at [https://web.stanford.edu/~jurafsky/slp3/21.
pdf](https://web.stanford.edu/~jurafsky/slp3/21.pdf) although we are downloading
from a copy just in case Jurafsky changes their website.

In [None]:
#information_extraction_pdf = 'https://github.com/KnowledgeLab/content_analysis/raw/data/21.pdf'

infoExtractionRequest = requests.get(information_extraction_pdf, stream=True)
print(infoExtractionRequest.text[:1000])

It says `'pdf'`, so thats a good sign. The rest though looks like we are having
issues with an encoding. The random characters are not caused by our encoding
being wrong, however. They are cause by there not being an encoding for those
parts at all. PDFs are nominally binary files, meaning there are sections of
binary that are specific to pdf and nothing else so you need something that
knows about pdf to read them. To do that we will be using
[`PyPDF2`](https://github.com/mstamy2/PyPDF2), a PDF processing library for
Python 3.


Because PDFs are a very complicated file format pdfminer requires a large amount
of boilerplate code to extract text, we have written a function that takes in an
open PDF file and returns the text so you don't have to.

In [None]:
def readPDF(pdfFile):
    #Based on code from http://stackoverflow.com/a/20905381/4955164
    #Using utf-8, if there are a bunch of random symbols try changing this
    codec = 'utf-8'
    rsrcmgr = pdfminer.pdfinterp.PDFResourceManager()
    retstr = io.StringIO()
    layoutParams = pdfminer.layout.LAParams()
    device = pdfminer.converter.TextConverter(rsrcmgr, retstr, laparams = layoutParams, codec = codec)
    #We need a device and an interpreter
    interpreter = pdfminer.pdfinterp.PDFPageInterpreter(rsrcmgr, device)
    password = ''
    maxpages = 0
    caching = True
    pagenos=set()
    for page in pdfminer.pdfpage.PDFPage.get_pages(pdfFile, pagenos, maxpages=maxpages, password=password,caching=caching, check_extractable=True):
        interpreter.process_page(page)
    device.close()
    returnedString = retstr.getvalue()
    retstr.close()
    return returnedString

First we need to take the response object and convert it into a 'file like'
object so that pdfminer can read it. To do this we will use `io`'s `BytesIO`.

In [None]:
infoExtractionBytes = io.BytesIO(infoExtractionRequest.content)

Now we can give it to pdfminer.

In [None]:
print(readPDF(infoExtractionBytes)[:550])

From here we can either look at the full text or fiddle with our PDF reader and
get more information about individual blocks of text.

## Word Docs

The other type of document you are likely to encounter is the `.docx`, these are
actually a version of [XML](https://en.wikipedia.org/wiki/Office_Open_XML), just
like HTML, and like HTML we will use a specialized parser.

For this class we will use [`python-docx`](https://python-
docx.readthedocs.io/en/latest/) which provides a nice simple interface for
reading `.docx` files

In [None]:
#example_docx = 'https://github.com/KnowledgeLab/content_analysis/raw/data/example_doc.docx'

r = requests.get(example_docx, stream=True)
d = docx.Document(io.BytesIO(r.content))
for paragraph in d.paragraphs[:7]:
    print(paragraph.text)

This procedure uses the `io.BytesIO` class again, since `docx.Document` expects
a file. Another way to do it is to save the document to a file and then read it
like any other file. If we do this we can either delete the file afterwords, or
save it and avoid downloading the following time.

This function is useful as a part of many different tasks so it and others like it will be added to the helper package `lucem_illud` so we can use it later without having to retype it.

In [None]:
def downloadIfNeeded(targetURL, outputFile, **openkwargs):
    if not os.path.isfile(outputFile):
        outputDir = os.path.dirname(outputFile)
        #This function is a more general os.mkdir()
        if len(outputDir) > 0:
            os.makedirs(outputDir, exist_ok = True)
        r = requests.get(targetURL, stream=True)
        #Using a closure like this is generally better than having to
        #remember to close the file. There are ways to make this function
        #work as a closure too
        with open(outputFile, 'wb') as f:
            f.write(r.content)
    return open(outputFile, **openkwargs)

This function will download, save and open `outputFile` as `outputFile` or just
open it if `outputFile` exists. By default `open()` will open the file as read
only text with the local encoding, which may cause issues if its not a text
file.

In [None]:
try:
    d = docx.Document(downloadIfNeeded(example_docx, example_docx_save))
except Exception as e:
    print(e)

We need to tell `open()` to read in binary mode (`'rb'`), this is why we added
`**openkwargs`, this allows us to pass any keyword arguments (kwargs) from
`downloadIfNeeded` to `open()`.

In [None]:
d = docx.Document(downloadIfNeeded(example_docx, example_docx_save, mode = 'rb'))
for paragraph in d.paragraphs[:7]:
    print(paragraph.text)

Now we can read the file with `docx.Document` and not have to wait for it to be
downloaded every time.

# <font color="red">Exercise 3</font>
<font color="red">Construct cells immediately below this that extract and organize textual content from text, PDF or Word into a pandas dataframe.</font>


In [44]:
#need to access my txt file (https://drive.google.com/file/d/1mPe0kSSBgHeeBvkh7-Q2mm9VT4ekm003/view?usp=sharing)
from google.colab import drive
drive.mount('/content/drive')

# Set the working directory to the location of my file
%cd '/content/drive/My Drive'

reviews = 'steam_reviews.txt'

# Read the content of the file and split it by lines
with open(reviews, encoding='utf-8') as f:
    lines = f.read().splitlines()

# Create a DataFrame with a column named 'Review' containing the lines
review_df = pd.DataFrame({'Review': lines})

review_df.head(5)

Mounted at /content/drive
/content/drive/My Drive


Unnamed: 0,Review
0,"""funny,""\t""helpful,""\t""hour_played,""\t""recomme..."
1,"""2,""\t""4,""\t""578,""\t""Recommended,""\tExpansion ..."
2,"""126,""\t""1086,""\t""676,""\t""Recommended,""\tDead ..."
3,"""89,""\t""536,""\t""11,""\t""Recommended,""\tWargroove"
4,"""0,""\t""0,""\t""9,""\t""Recommended,""\tWallpaper En..."


In [45]:





#already mostly parsed in line with dataframe, so can just do this to more quickly get information (and in an organized way)
review_df = pd.read_csv(reviews, sep='\t')
review_df = review_df.replace(',', '', regex=True)
review_df.columns = [col.replace(',', '') for col in review_df.columns]

#view df
review_df

Unnamed: 0,funny,helpful,hour_played,recommendation,title
0,2,4,578,Recommended,Expansion - Hearts of Iron IV: Man the Guns
1,126,1086,676,Recommended,Dead by Daylight
2,89,536,11,Recommended,Wargroove
3,0,0,9,Recommended,Wallpaper Engine
4,579,456,128,Recommended,Factorio
5,0,0,79,Recommended,Insurgency: Sandstorm
6,3,408,17,Recommended,Cold Waters
7,30,342,11,Recommended,Tannenberg
8,4,202,267,Recommended,Pathfinder: Kingmaker
9,14,503,225,Recommended,MONSTER HUNTER: WORLD


# <font color="red">Exercise 4</font>

<font color="red">In the two cells immediately following, describe a possible project (e.g., it might end up being your final project, but need not be if you are still searching): **WHAT** you will analyze--the texts you will select and the social game, world and actors you intend to learn about through your analysis (<100 words); **WHY** you will analyze these texts to learn about that context--justify the rationale behind your proposed sample design for this project, based on the readings. What is the social game, social work, or social actors about whom you are seeking to make inferences? What are the virtues of your proposed sample with respect to your research questions? What are its limitations? What are alternatives? What would be a reasonable path to "scale up" your sample for further analysis (i.e., high-profile publication)? (<150 words)? [**Note**: your individual or collective project will change over the course of the quarter as new data and/or analysis opportunities arise or if old ones fade away.]

## ***What?***

I'll use the GameDeals subreddit from Reddit to get the dates of free price promotions and the games which had these promotions. I'll also use steamcharts.com to get the player counts at various points in time for these games. Lastly, I'll get the number of reviews off of store.steampowered.com (if possible) for the games with free price promotions. I intend to learn about consumers in the online game market (with the identification strategy being Steam users). Additionally, I might glean insights about differences across types of games in this market.

## ***Why?***

I intend to learn about how consumers in this market respond to free price promotions, potentially using cues from these promotions (e.g., increases in number of reviews/player counts) to make their choices. Steam is the largest online game retailer, so using their data makes sense for generalization and getting maximal data (so my assumptions can be weaker, in accordance with the readings/lecture). However, Epic Games is another online game retailer with more systematic and clear-cut free price promotions, making it a great alternative to Steam data if need be. In fact, forming a theory based on the Steam data and checking its robustness with the Epic Games market would help with scaling up. Fortunately, Steam data is well-reported, so I don't have to worry about bias there, but free price promotion reporting on Reddit could be biased to popular games, so I'll have to be careful there.

### Sources:

Other popular sources for internet data:

[reddit](https://www.reddit.com/) - https://praw.readthedocs.io/en/v2.1.21/

[twitter](https://twitter.com/) - https://pypi.org/project/python-twitter/

[project gutenburg](https://www.gutenberg.org/) - https://github.com/ageitgey/Gutenberg

