## Challenge 1: More data structures - Array

Lists are called arrays when they contain data of the same type, and may be preferred to lists when handling large amounts of data because they are computationally more efficient. Use your numpy library to create a "numpy array":

In [3]:
import numpy as np

In [4]:
# Define a numpy array
my_array = np.array([4, 5, 6, 10])

print(my_array)
print(type(my_array))

[ 4  5  6 10]
<class 'numpy.ndarray'>


Like lists, we can change an individual value by reassigning its index:

In [5]:
my_array[0] = 42
print(my_array)

[42  5  6 10]


Define an array based on a range of numbers:

In [9]:
number_range = np.arange(14,29)
print(number_range)



[14 15 16 17 18 19 20 21 22 23 24 25 26 27 28]


What if I want just the third, fourth and fifth elements of "number_range"

In [None]:
number_range[2:5]


array([16, 17, 18])

## Challenge 2: More data structures - Tuple

Tuples are similar to lists and arrays, but they are unchangeable ("immutable"). 

For example, we cannot change the output of 10 (the dividend) divided by 3 (the divisor) - by definition the quotient is 3 and the remainder is 1 this cannot be changed. 

> Although they appear similar to lists, note that tuples use round parentheses `()` to store information instead of square brackets.

In [10]:
div = divmod(10,3)
print(div)
print(type(div))

(3, 1)
<class 'tuple'>


Figure out another way to define any tuple below:

In [11]:
# Empty tuple
tuple_empty = ()
print(tuple_empty)
print(type(tuple_empty))

()
<class 'tuple'>


In [12]:
# Tuple with floats
tuple_float = (1.12, 2.23, 3.14)
print(tuple_float)
print(type(tuple_float))

(1.12, 2.23, 3.14)
<class 'tuple'>


In [13]:
# Complex tuple
tuple_complex = ([1, 3, 12], True, "hello!", (3.14, 3.14, 3.15))
print(tuple_complex)
print(type(tuple_complex))

([1, 3, 12], True, 'hello!', (3.14, 3.14, 3.15))
<class 'tuple'>


How do you think you index or slice a tuple? HINT: You are already familiar with this! Provide one example below.

In [14]:
print(tuple_complex[1])

# or

print(tuple_complex[3][2])

True
3.15


## Challenge 3: More data structures - Series

Series are the pandas version of arrays - one dimensional data storage, preferably of the same type.

In [None]:
pd.Series?

What must you do before you can call the help files for `pd.Series`?

In [16]:
import pandas as pd

In [None]:
pd.Series?

What are the two arguments in the `pd.Series` code below?

In [None]:
pd_series = pd.Series([1, 2, 3, 4, 5], 
                      index = ["a", "b", "c", "D", "E"])
print(pd_series)
print(type(pd_series))

## Reminder: Pandas Dataframe and zip()

A pandas DataFrame is an ordered group of equal-length series/arrays. This ensures that data are of the same type within each column, but that data can be of different types across rows (think like an MS Excel spreadsheet!)

> Pandas was a large part of our course in Week 2. This is just a little refresher!

In [None]:
# Create a DataFrame using lists

# Define two lists
heroes = ["Superman", "Black Panther", "Wonder Woman", "Batman", "Storm"]
hometown = ["Krypton", "Wakanda", "Themyscira", "Gotham", "Harlem"]
age = [23, 25, 41, 52, 22]

# Create an empty DataFrame
comics = pd.DataFrame()
print(comics)

In [None]:
# Use the lists to define the columns
comics["Name"] = heroes
comics["Home"] = hometown
comics["Age"] = age

# View the data frame
comics

### `zip()`

The `zip()` function is a handy way to define data frames using tuples. 

In [None]:
zip?

In [None]:
# Using zip()

# Define lists to be the columns in our data frame
day = ["Monday", "Tuesday", "Friday"]
temp = [88, 91, 67]

# Convert our two separate lists into a list of tuples
tuples = list(zip(day, temp))
tuples

# Create the data frame 
pd.DataFrame(tuples, columns=["Day", "Temp"])

Create a small data frame like the ones above from scratch. Be creative! 

In [17]:
Name = ["Student A", "Student B", "Student C", "Student D"]
Reading = [88, 99, 92, 78]
Math = [100, 95, 77, 99]
Science = [78, 88, 85, 80]
Art = [68, 88, 90, 100]

students = pd.DataFrame([Reading, Math, Science, Art],  
                        index = Name,
                        columns = ["Reading", "Math", "Science", "Art"])
students



Unnamed: 0,Reading,Math,Science,Art
Student A,88,99,92,78
Student B,100,95,77,99
Student C,78,88,85,80
Student D,68,88,90,100


In `pd.DataFrame`, what does index and columns do?

## Challenge 5: XML

.csv files are nice when we know the structure of the data (rows and columns) that we can quickly perform operations on. However, we are limited to this format, which can be difficult to incorporate into nested, hierarchical data structures. 

[Extensible Markup Language (.xml)](https://www.sitepoint.com/really-good-introduction-xml/) is good for representing data that have a hierarchical structure. Large, web-based datasets might be stored this way, and we can use their human readable format to select tags that we want to extract for analysis. However, these data are larger than .csv files because of the opening and closing "tags". 

There are a few libraries in Python that will parse XML for you, but we'll stick to the standard `xml` library in this course. Let's start out with a simple example. Suppose we want to store the metadata for the books in our library in an XML file. It might look something like this:

In [18]:
%pwd

'/Users/mervetekgurler/Desktop/PhD/UCB Digital Humanities Summer 2025/DIGHUM101-2025/Week 3'

In [20]:
import os
os.chdir("../Data")

In [21]:
os.getcwd()

'/Users/mervetekgurler/Desktop/PhD/UCB Digital Humanities Summer 2025/DIGHUM101-2025/Data'

In [22]:
%ls

2014wesp_country_classification.pdf  example.xml
[34mGeo[m[m/                                 feminism.json
Obama_tweets.csv                     feminism.xml
[34mUSA[m[m/                                 frankenstein.txt
USA_adm.zip                          gapminder-FiveYearData.csv
baldwin_search.csv                   gapminder.csv
billnye tweets.csv                   gapminder.tsv
blm reddit.csv                       gapminder_gni.csv
books_search.csv                     [34mhuman-rights[m[m/
childrens_lit.csv                    human_rights.csv
[34mchis_data[m[m/                           iris.csv
citations.csv                        [34mmontagu[m[m/
compound_figure.pdf                  music_reviews.csv
correspondence-data-1585.csv         puppies_search.csv
covid_search.csv                     r_conspiracy.csv
dracula.txt                          us_racism_search.csv
example.json


In [23]:
!cat example.xml

<my-library>
    <book>
        <title>The Lion, the Witch and the Wardrobe</title>
        <author>C. S. Lewis</author>
        <date>1950</date>
        <publisher>Geoffrey Bles</publisher>
    </book>
    <book>
        <title>The Hobbit</title>
        <author>J. R. R. Tolkien</author>
        <date>1937</date>
        <publisher>George Allen and Unwin</publisher>
    </book>
    <book>
        <title>To Kill A Mockingbird</title>
        <author>Harper Lee</author>
        <date>1960</date>
        <publisher>J. B. Lippincott and Co.</publisher>
    </book>
</my-library>

To parse this in Python, we'll need to import the `xml` library and then build the tree. From the `tree` object, we can get the root of the tree, and then traverse the tree like we would our filesystem paths.

In [None]:
import xml.etree.ElementTree as ET # loads functions from `xml` library under the name `ET`

tree = ET.parse("example.xml")
root = tree.getroot()
print(ET.tostring(root))

This is how it would look if we typed it out:

In [None]:
xml_string = '''
<my-library>
    <book>
        <title>The Lion, the Witch and the Wardrobe</title>
        <author>C. S. Lewis</author>
        <date>1950</date>
        <publisher>Geoffrey Bles</publisher>
    </book>
    <book>
        <title>The Hobbit</title>
        <author>J. R. R. Tolkien</author>
        <date>1937</date>
        <publisher>George Allen and Unwin</publisher>
    </book>
    <book>
        <title>To Kill A Mockingbird</title>
        <author>Harper Lee</author>
        <date>1960</date>
        <publisher>J. B. Lippincott and Co.</publisher>
    </book>
</my-library>
'''

In [None]:
root = ET.fromstring(xml_string)
print(ET.tostring(root))

We can get the direct children of the root with the `getchildren()` method:

In [None]:
root.getchildren()

`getchildren()` will always yield the elements subordinate to the parent element in a hierarchy. If we look at the XML string above, we can get the children of a `book` elements as well:

In [None]:
first_book = root.getchildren()[0]
first_book.getchildren()

We can also use the `find` to quickly find an element. We'll use a for-loop to print the author for each book, which we get with the `find` method for each of the elements in the children above, and we get the text with the `text` property:

In [None]:

for book in root:
    print(book.find('author').text)

Remember that in XML you can have elements which are usually between the two `< >` signs and can be found using the `find` or `getchildren` methods. To get the actual information, or text, from that element you can use the `.text` property.

## Challenge 6: XML Wikipedia data

Let's work with a real-world dataset. We're going to work with the revision history of a Wikipedia page. You can get this through Wikipedia's [API](https://www.mediawiki.org/wiki/API:Revisions) or [download](https://en.wikipedia.org/wiki/Wikipedia:Database_download) data directly. We'll look at the API more in a later module, so we've downloaded a page of revisions and saved it to the XML file in `../data/WIKIPEDIA/feminism.xml`

In [None]:
!head feminism.xml

We can start off by parsing it like we did above with our books:

In [None]:
import xml.etree.ElementTree as ET
tree = ET.parse("feminism.xml")
root = tree.getroot()

At this point, you can either look in the XML in a text editor, or start the process of looking through the tree manually to find the actual metadata of each revision. You'll have to do this when we work with JSON in a later challange, so don't be afraid if none of this makes sense at this point!

In [None]:
root.getchildren()

In [None]:
root.find('query').getchildren()

In [None]:
root.find('query/pages').getchildren()

In [None]:
root.find('query/pages/page').getchildren()

In [None]:
root.find('query/pages/page/revisions').getchildren()

Found it! Those are the revision items we want. Let's assign that list to a variable:

In [None]:
revisions = root.find('query/pages/page/revisions')

Now we can loop through and ask for some of the specific metadata about each revision, the `timestamp` for example:

In [None]:
for rev in revisions.getchildren():
    print(rev.get('timestamp'))

## Challenge 7: JSON

As a Pythonista, after dealing with XML you'll be happy to see [JSON](https://en.wikipedia.org/wiki/JSON). JavaScript Object Notation (JSON) is preferred because it looks exactly like a Python `dictionary`. The `json` library takes care of the few differences between the two so that we don't have to ourselves. Let's take a look at the same data about the books in our library from the XML notebook in JSON format:

In [None]:
!cat example.json

Looks much nicer than XML and its many tags! In Python, it looks like a `dictionary` with a key of `my_library` and a value of a `list` of `dictionary` objects. Each of these `dictionary` objects in the list contains the metadata. We can use the `json` library to make it exactly that!

In [24]:
import json

my_library = json.load(open("example.json"))
my_library

{'my_library': [{'title': 'The Lion, the Witch and the Wardrobe',
   'author': 'C. S. Lewis',
   'date': '1950',
   'publisher': 'Geoffrey Bles'},
  {'title': 'The Hobbit',
   'author': 'J. R. R. Tolkien',
   'date': '1937',
   'publisher': 'George Allen and Unwin'},
  {'title': 'To Kill A Mockingbird',
   'author': 'Harper Lee',
   'date': '1960',
   'publisher': 'J. B. Lippincott and Co.'}]}

In [25]:
print(type(my_library))

<class 'dict'>


Index `my_library` to return only the year "The Hobbit" was published.

In [26]:
my_library["my_library"][1]["date"]



'1937'

You will visualize some XML and JSON data next week!

## Challenge 8: JSON Wikipedia Data

Let's go back to that revision history data. Wikipedia actually prefers that you get the data in JSON format, so we've downloaded some more data for you in JSON, it's located at `../data/WIKIPEDIA/feminism.json`:

In [27]:
feminism_json = json.load(open("feminism.json"))
feminism_json[0]

{'user': 'N0n3up',
 'timestamp': '2017-07-02T21:35:12Z',
 'size': 138368,
 'comment': 'Undid revision 788682640 by [[Special:Contributions/N0n3up|N0n3up]] ([[User talk:N0n3up|talk]]) sigh',
 'parsedcomment': 'Undid revision 788682640 by <a href="/wiki/Special:Contributions/N0n3up" title="Special:Contributions/N0n3up">N0n3up</a> (<a href="/wiki/User_talk:N0n3up" title="User talk:N0n3up">talk</a>) sigh',
 'tags': []}

In [28]:
feminism_json[0].keys()

dict_keys(['user', 'timestamp', 'size', 'comment', 'parsedcomment', 'tags'])

In [29]:
feminism_json[0]['timestamp']

'2017-07-02T21:35:12Z'