## Exercises week 1

# Working with textual data

### 0. Get the data.

- Download  `articles.tar.gz` or `articles.zip` from Canvas (under `Week 1`). Please note that this is not the full dataset, but random sample of the data described [here](https://dx.doi.org/10.7910/DVN/ULHLCB).


<div class="alert-danger">
<p>Alternatively, you can also download <code>articles.tar.gz</code> from
<a href="https://dx.doi.org/10.7910/DVN/ULHLCB">https://dx.doi.org/10.7910/DVN/ULHLCB</a> to get <strong>all</strong> the data. Please note that this is a very large dataset, and for practice purposes, you do not need everything. </p>
</div>



- Unpack it. On Linux and MacOS, you can do this with `tar -xzf mydata.tar.gz` on the command line. On Windows, you may need an additional tool such as `7zip` for that (note that technically speaking, there is a `tar` archive within a `gz` archive, so unpacking may take *two* steps depending on your tool).


### 1. Inspect the structure of the dataset.
What information do the following elements give you?

- folder (directory) names
- folder structure/hierarchy
- file names
- file contents

### 2. Discuss strategies for working with this dataset!

- Which questions could you answer?
- How could you deal with it, given the size and the structure?

### 3. Read some (or all?) data

Here is some example code that you can modify. you could, for instance, do the following to read a *part* of your dataset.

In [None]:
from glob import glob
import random
from nltk.tokenize import TreebankWordTokenizer
from nltk.corpus import stopwords
from nltk.stem.snowball import SnowballStemmer
import matplotlib.pyplot as plt
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
import pandas as pd
#specify the path to your unpacked articles.
PATH = 'articles/'

In [None]:
newspaperfiles = glob(PATH+'/*/Vox/*')
documents = []
for filename in newspaperfiles:
    with open(filename) as f:
        documents.append(f.read())

In [None]:
len(documents)

<div class="alert-info">
<ul>
<li>Can you explain what the <code>glob</code> function does?</li>
<li>Can you modify the code so to read in e.g., <code>The Guardian</code> or another news source in the folder <code>articles</code>?</li>
<li>What does <code>documents</code> contain? First make an educated guess based on the code snippet, then check it! Do <em>not</em> print the whole thing, but use <code>len()</code>, <code>type()</code> en slicing (e.g.,<code>[:10]</code>) to get the info you need.</li>
</ul>
</div>


<br>
<div class="alert-block alert-warning">
<p>Tip: take a random sample of the articles for practice purposes (if your code works, you can scale up!)</p><code>articles =random.sample(documents, 10)</code></p>
</div>

### 4. Perform some analyses!

- Perform some first analyses on the data using string methods and regular expressions!

Techniques you can try out include:

a.  lowercasing

b.  tokenization

c.  stopword removal

d.  stemming and/or lemmatizing

In [None]:
articles =random.sample(documents, 10)

    #a. lowercasing articles

In [None]:
articles_lower_cased = []
for art in articles:
    articles_lower_cased.append(art.lower())

In [None]:
#same, using list comprehension:
articles_lower_cased = [art.lower() for art in articles]

    #b. tokenization

In [None]:
#b. Basic approach to tokenization
articles_split = []
for art in articles:
    articles_split.append(art.split())

# make sure to print often to check your progress:
# print(articles_split[0])

In [None]:
# same, using list comprehension:
articles_split = [art.split() for art in articles]

In [None]:
#. More advanced approach to tokenization
tokenizer = TreebankWordTokenizer()

articles_tokenized = []
for art in articles:
    articles_tokenized.append(tokenizer.tokenize(art))

In [None]:
#. Same, but using list comprehension:
articles_tokenized = [tokenizer.tokenize(art) for art in articles ]

    #c. stopword removal

In [None]:
#initialize the a stopword list
mystopwords = stopwords.words("english")

#check what is in there:
print(mystopwords)

In [None]:
# manually add more stopwords to your list if needed:
mystopwords.extend(["add", "more", "words"]) 
print(mystopwords)

In [None]:
#Now, remove these stopwords from the corpus:

articles_without_stopwords = []
for article in articles:
    articles_no_stop = ""
    for word in article.lower().split():
        if word not in mystopwords:
            articles_no_stop = articles_no_stop + " " + word
    articles_without_stopwords.append(articles_no_stop)

In [None]:
# same as the cell above, this time with list comprehension

articles_without_stopwords = [" ".join([w for w in article.lower().split() if w not in mystopwords]) for article in articles]


<br>
<div class="alert-block alert-warning">
It's good practice to frequently inspect the results of your code, to make sure you are not making mistakes, and the results make sense. For example, compare your results to some random articles from the original sample:
</ul>
</div>

In [None]:
print(articles[8][:100])
print("-----------------")
print("".join(articles_without_stopwords[8])[:100])

    #d. stemming

In [None]:
stemmer = SnowballStemmer("english")

stemmed_text = []
for article in articles:
    stemmed_words = ""
    for word in article.lower().split():
        stemmed_words = stemmed_words + " " + stemmer.stem(word)
    stemmed_text.append(stemmed_words.strip())

same as the cell above, this time with list comprehension:

In [None]:
stemmed_text  = [" ".join([stemmer.stem(w) for w in article.lower().split()]) for article in articles]

Please note that alternative stemmers are available through the `nltk` library. 
E.g., you can try experimenting with `NLTK's Porter Stemmer`:

```python
porter_stemmer = nltk.stem.PorterStemmer()
```


## OPTIONAL:

If you want to try out Lemmatization, you need to download a language model from Spacy. 

Please run the following command in your terminal:

`python3 -m spacy download en_core_web_sm`

In [None]:
import nltk
import spacy

# Initialize spaCy's English language model
nlp = spacy.load("en_core_web_sm")

# Define a sentence to be stemmed and lemmatized. In this case, we will try with a single random article: articles[1]
# Use spaCy's English language model to lemmatize the sentence
lemmatized_sentence = " ".join([token.lemma_ for token in nlp(articles[1])])

# Print the original sentence, stemmed sentence, and lemmatized sentence
print(articles[1][0:100])
#print("**********")
print(lemmatized_sentence[0:100])