# Data and Data Collection


Author: Justin Chun-ting Ho

Date: 27 Nov 2023

Credit: Some sections are adopted from the slides prepared by Damian Trilling

## Tentative Plans for today:



| Time          | Topic | Location |
|-------------------|-----------|-------------|
| 9:30 - 11:00      | Data and File Types  | V2.09        |
| 11:15 - 12:30 | API      | V2.09       |
| 13:30 - 15:00 | Web Scraping | V1.13        |
| 15:00 - 16:00 | Exercise | V1.13        |


## Three popular ways to get data

- Datasets
- API
- Webscraping

## Datasets

- Public datasets
- [Google Dataset Search](https://datasetsearch.research.google.com/)
- Good people who shared with you
- For many topics, they don't exist!

## API

- Many platforms have an API
- But many of the APIs are not open
- APIs are not designed for researchers (!)

## Webscraping

- Theoretically all webpages can be scraped
- In many cases, it is much harder than it seems
- In other cases, web designers actively stop you from scrapping

## Some Considerations

- Some methods need higher maintance (you need to keep updating your teaching materials and knowledge)
- Success rate (some methods have a higher chance to fail)
- The "Cost Revenue Ratio" (does the data you get worth the effort to learn/maintain?)
- Availablity in the future (for example, [this](https://github.com/cjbarrie/academictwitteR))

## Data structures and files

| Use Case          | Data Type | File Format |
|-------------------|-----------|-------------|
| texts             | string    | .txt        |
| hierarchical data | dict      | .json       |
| table             | dataframe | .csv        |

## String -> .txt

In [9]:
data = """
In a hole in the ground there lived a hobbit. Not a nasty, dirty, wet hole, filled with the
ends of worms and an oozy smell, nor yet a dry, bare, sandy hole with nothing in it to sit down
on or to eat: it was a hobbit hole, and that means comfort.
"""
with open("data.txt", mode="w") as file:
    file.write(data)

Attention: It will overwrite without warning.

In [12]:
data = """
When Mr Bilbo Baggins of Bag End announced that he would shortly be celebrating his eleventy-first birthday 
with a party of special magnificence, there was much talk and excitement in Hobbiton.
"""
with open("data.txt", mode="w") as file:
    file.write(data)

## .txt -> String

In [13]:
with open("data.txt", mode="r") as file:
    passage = file.read()
    
print(passage)


When Mr Bilbo Baggins of Bag End announced that he would shortly be celebrating his eleventy-first birthday 
with a party of special magnificence, there was much talk and excitement in Hobbiton.



## list -> file

In [14]:
data = ["Gandalf", "Frodo", "Sam", "Aragorn", "Legolas", "Gimli", "Peregrin", "Meriadoc", "Boromir"]

In [17]:
with open("fellowship.txt", mode="w") as f:
    for datum in data:
        f.write(datum + "\n")

In [20]:
with open("fellowship.txt", mode="r") as f:
     fellowship = [line for line in f]
print(fellowship)

['Gandalf\n', 'Frodo\n', 'Sam\n', 'Aragorn\n', 'Legolas\n', 'Gimli\n', 'Peregrin\n', 'Meriadoc\n', 'Boromir\n']


## dict -> file

In [25]:
import json

data = {'Gandalf': {'office': '020222', 'mobile': '0666666'},
    'Frodo': {'office': '030111'},
    'Sam': {'office': '040444', 'mobile': '0644444'},
    'Aragorn': "020222222",
    'Legolas': ["010111", "06222"]}

with open("phonebook.json", mode="w") as f:
    json.dump(data, f)

In [28]:
import json

with open("phonebook.json", mode="r") as f:
       data = json.load(f)
        
print(data)

{'Gandalf': {'office': '020222', 'mobile': '0666666'}, 'Frodo': {'office': '030111'}, 'Sam': {'office': '040444', 'mobile': '0644444'}, 'Aragorn': '020222222', 'Legolas': ['010111', '06222']}


## object -> pickle

In [30]:
import pickle

with open('phonebook.pkl', 'wb') as f:
    pickle.dump(data, f)

In [31]:
with open('phonebook.pkl', 'rb') as f:
    phonebook = pickle.load(f) # deserialize using load()

print(phonebook) # print student names

{'Gandalf': {'office': '020222', 'mobile': '0666666'}, 'Frodo': {'office': '030111'}, 'Sam': {'office': '040444', 'mobile': '0644444'}, 'Aragorn': '020222222', 'Legolas': ['010111', '06222']}


Note: Some .json files contain one JSON object per line instead one object per file. You would need a for loop to convert each line and combine them together.

## tabular data -> .csv

In [36]:
# How can we store the data?
names = ['Gandalf', 'Frodo', 'Sam', 'Aragorn', 'Legolas']
phonenumbers = ['020111111', '020222222', '020333333', '020444444', '020555555']

1. We can convert to dict and store as json (it works)

2. We can also store in a table (.csv)

In [40]:
import csv

with open("phonebook.csv", mode="w") as f:
    mywriter = csv.writer(f)
    for row in zip(names,phonenumbers):
        mywriter.writerow(row)

Note: you don't have to do it like this! In most cases, pandas is a more efficient way to do it. But there is one more thing about text data...

## Encodings

### How to separate data?

- new line = new record?
- Unix-style (\n), also known as LF),  or Windows-style (\r\n, also known as CRLF) line endings?
- what delimiter? Most uses comma (,), but some uses tab, or even whitespace!
- or new file = new record?

### How to convert characters to bytes (and back again)?

- choose right encoding (it is almost always UTF-8)
- note some applications uses other encodings by default (eg Excel)

### comma-separated values (csv)

- All programs can read it
- Human-readable in a simple text editor
- Plain text, with a comma (or a semicolon) denoting column breaks
- No limits regarding the size (in most case)
- Note: several dialects (eg , ; \t), and extensions (eg .csv, .tab, .txt)

In [46]:
# Let's go back to the csv we just created

with open("phonebook.csv", "r") as f:
    output_text = f.read()

output_text

'Gandalf,020111111\nFrodo,020222222\nSam,020333333\nAragorn,020444444\nLegolas,020555555\n'

### Which is the worse editor and why it is Excel?

- Sometimes legacy encodings (ASCII, ANSI, Windows-1252 etc) are still used
- They don't support all Unicode symbols (eg emojis, accented characters, non-latin scripts)
- What is an ä in the one encoding may be an ø in another
- Some programs use legacy encodings without telling you!
- Use UTF-8 from beginning to end, unless you like chaos

## Dataframes

### What are dataframes?

- pd.DataFrames (from the pandas package)
- objects that store tabular data in rows and columns.
- columns and rows can have names
- they have methods built-in for data wrangling and analysis

### Creating dataframes

In [53]:
# Option 1:
# df = pd.DataFrame(list-of-lists, dict, or similar), use "pd.DataFrame?" for help

df = pd.DataFrame(names, phonenumbers)
df

Unnamed: 0,0
20111111,Gandalf
20222222,Frodo
20333333,Sam
20444444,Aragorn
20555555,Legolas


In [60]:
# Option 2:
# read from file

df = pd.read_csv("phonebook.csv", header=None)
df

Unnamed: 0,0,1
0,Gandalf,20111111
1,Frodo,20222222
2,Sam,20333333
3,Aragorn,20444444
4,Legolas,20555555


### When to use dataframes?

- tabular data
- visual inspection
- data wrangling or statistical analysis

### When not to use dataframes?

- non-tabular data
- when it does not make sense to consider rows as cases and columns as variables
- if you only care about one (or maybe two) column anyway
- size of dataset > available RAM
- long or expensive operations, play safe and write to / read from file line by line

## Discussion

### What is your preferred way to store the following data?

- Data about YouTube videos
- News article scrapped from an online outlet
- Comments in a online forum
- Results from an online survey
- Any other data from your work?