<a href="https://colab.research.google.com/github/yirouleh/cs167notes/blob/main/Day24_f21_WebScrapingDemo.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Let's Scrape some Fantasy Character Names

## Brando Sando Style:

Let's scrape the names from Brandon Sanderson Characters found on [this website](https://coppermind.net/wiki/Category:Characters). 

Let's start by importing Beautiful soup and providing the URL pointing to the webpage. We use the `requests` library to get the text from the html. 

In [None]:
from bs4 import BeautifulSoup
import requests
import csv

source = requests.get('https://coppermind.net/wiki/Category:Characters').text

soup = BeautifulSoup(source, 'html.parser')

csv_file = open('fantasy_characters.csv', 'w')
csv_writer = csv.writer(csv_file)
csv_writer.writerow(['name', 'author','genre', 'series', 'description']) 

print(soup.prettify())

Since the names are stored in `li`'s, we will use the `soup.find_all` function and provide the argument `li`. This will find all of the list items on the page. This is a good first step, but you'll notice, it's not quite what we want--it includes the list items before the actual charatcter names. 

Also, notice the line `li.text`. We are going through and getting each list item, and then only looking at the text (the `.text` line indicates this).

In [None]:
for li in soup.find_all('li'):
  print(li.text)

So now that we can get the characters's names, we want to give it the correct starting point. here we'll use the `soup.find` function, and after using the 'Inspect' function on the browser, we know we're looking for the `div` element with `id='mw-pages'`. If we save this as `section`, we can then use this section of the website, combined with the code from above to collect a list of all of the names of Brandon Sanderson characters. 

In [3]:
section = soup.find('div', id='mw-pages')

count = 0

for li in section.find_all('li'):
  count = count + 1
  name = li.text
  
  #print the first few lines just to make sure it looks good
  if count < 10: 
    print(count, name)

  #add some other information to the csv file.
  author = 'Brandon Sanderson'
  genre = 'fantasy'
  series = ''
  description =''
  csv_writer.writerow([name, author, genre, series, description])
  

1 Aarik
2 Aaron
3 Abaray
4 Abiajan
5 Abigail Casey
6 Abraham Desjardins
7 Abrial
8 Abrobadar
9 Abronai


## Now let's add some Wheel of Time:

New URL: https://en.wikipedia.org/wiki/List_of_Wheel_of_Time_characters 

You'll notice that the way this website is set up is a bit different from the Brando Sando website. We'll have to adjust our web-scarping code accordingly.


In [4]:
source = requests.get('https://en.wikipedia.org/wiki/List_of_Wheel_of_Time_characters').text

soup = BeautifulSoup(source, 'html.parser')
section = soup.find('div', class_='mw-parser-output')

count = 0
for li in section.find_all('li'):

  #This one isn't quite as simple becuase we have a character name and description.
  count = count + 1
  list_item = li.text

  #not all of the entries are names, but all the names follow the syntax <name: description>  
  if(len(list_item.split(':'))>1): 
    name= list_item.split(':')[0]  #The first element is the name so we use [0]
    description = list_item.split(':')[1] #the description is after the :, so we use [1]


    # SPOILER ALERT: DONT PRINT DESCRIPTION OUT IF YOU DON'T WANT TO READ THE CHARACTER DESCRIPTIONS.
    if count < 10:
      print(count, name)

    #then we add some other information and write a row to the csv file
    series = 'Wheel of Time'
    author = 'Robert Jordan'
    genre= 'fantasy'
    csv_writer.writerow([name, author, genre, series, description])

1 Logain Ablar
2 Jonan Adley
3 .mw-parser-output .vanchor>
4 Lelaine Akashi
5 Nalesean Aldiaya
6 Algarin Pendaloan
7 Alivia
8 Katerine Alruddin
9 Doesine Alwain


## How about some Kingkiller Chronicles Characters
https://kingkiller.fandom.com/wiki/Category:Characters


In [9]:
source = requests.get('https://kingkiller.fandom.com/wiki/Category:Characters').text


# Now it's your turn... try to scrape the character names from the Kingkiller Chronicles 

soup = BeautifulSoup(source, 'html.parser')
section = soup.find('div', class_='category-page__members')

count = 0
for li in section.find_all('li'):

  #This one isn't quite as simple becuase we have a character name and description.
  count = count + 1
  list_item = li.text
  name = li.text.strip()

    # SPOILER ALERT: DONT PRINT DESCRIPTION OUT IF YOU DON'T WANT TO READ THE CHARACTER DESCRIPTIONS.
  if count < 10:
     print(count, name)

    #then we add some other information and write a row to the csv file
  series = 'Kingkiller Chronicles'
  author = 'Patrick Rothfuss'
  genre= 'fantasy'
  csv_writer.writerow([name, author, genre, series, description])


1 Aaron
2 Abenthy
3 Aethe
4 Category:Alchemist
5 Alder Whin
6 Aleph
7 Alleg
8 Ambrose Jakis
9 Amlia


## Lord of the Rings, Anyone?
Yesss, my preciousssss: https://lotr.fandom.com/wiki/Category:The_Lord_of_the_Rings_Characters

In [10]:
source = requests.get('https://en.wikipedia.org/wiki/Category:The_Lord_of_the_Rings_characters').text

soup = BeautifulSoup(source, 'html.parser')
section = soup.find('div', class_='mw-category')

count = 0
for li in section.find_all('li'):
  if count != 0:
    name= li.text
    print(name)
    author = 'J. R. R. Tolkien'
    genre = 'fantasy'
    series = 'Lord of the Rings'
    description =''
    csv_writer.writerow([name, author, genre, series, description])
  count= count + 1

Aragorn
Arwen
Bilbo Baggins
Frodo Baggins
Beechbone
Tom Bombadil
Boromir
Merry Brandybuck
Bregalad
Celeborn
Déagol
Denethor
Elendil
Elrond
Éomer
Éomund
Éothain
Éowyn
Faramir
Fellowship of the Ring (characters)
Galadriel
Samwise Gamgee
Gandalf
Gimli (Middle-earth)
Glorfindel
Goldberry
Gollum
Gríma Wormtongue
Isildur
Khamûl
Legolas
Nazgûl
Radagast
Saruman
Sauron
Shelob
Théoden
Pippin Took
Treebeard


# Annnnddd... let's give that bad boy a download

In [11]:
csv_file.close()
from google.colab import files
files.download('fantasy_characters.csv')

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>