In [2]:
# import libraries
import pandas as pd
import nltk
import spacy

# change this to your own data directory
data_dir = "data/"

# read and preprocess data
text_file_name = "osdg-community-data-v2023-01-01.csv"
text_df = pd.read_csv(data_dir + text_file_name,sep = "\t",  quotechar='"')
col_names = text_df.columns.values[0].split('\t')
text_df[col_names] = text_df[text_df.columns.values[0]].apply(lambda x: pd.Series(str(x).split("\t")))
text_df = text_df.astype({'sdg':int, 'labels_negative': int, 'labels_positive':int, 'agreement': float}, copy=True)
text_df.drop(text_df.columns.values[0], axis=1, inplace=True)

# Solutions to Exercises

## Preprocessing

**Exercise 1.1**

Answers may vary.

**Exercise 1.2**

Answers may vary.

**Exercise 1.3**
The following code removes any rows that contain only N/A values. In this case, there are no such rows to remove.

In [3]:
nrows_old = text_df.shape[0]
text_df.dropna(axis=0, how='all', inplace=True)
print("Number of rows removed:", nrows_old - text_df.shape[0])

Number of rows removed: 0


The next line of code checks for the existence of any remaining N/A values. It turns out that there are none.

In [4]:
text_df.isna().any()

doi                False
text_id            False
text               False
sdg                False
labels_negative    False
labels_positive    False
agreement          False
dtype: bool

Whether or not entries with N/A values should be removed depends on the dataset and the nature of the problem. Sometimes, entries with N/A values should be dropped, while at other times, they should be kept unchanged, or replaced with interpolated or placeholder values. Consult [the `pandas` documentation](https://pandas.pydata.org/pandas-docs/stable/user_guide/missing_data.html) for more information about how to deal with missing values in dataframes.

**Exercise 1.4**

After filtering the dataset, we inspect it using the `info()` function.

In [5]:
# filter the dataset
text_df = text_df.query("agreement > 0.5 and (labels_positive - labels_negative) > 2")
text_df.reset_index(inplace=True, drop=True)

# inspect it
text_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 24669 entries, 0 to 24668
Data columns (total 7 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   doi              24669 non-null  object 
 1   text_id          24669 non-null  object 
 2   text             24669 non-null  object 
 3   sdg              24669 non-null  int64  
 4   labels_negative  24669 non-null  int64  
 5   labels_positive  24669 non-null  int64  
 6   agreement        24669 non-null  float64
dtypes: float64(1), int64(3), object(3)
memory usage: 1.3+ MB


We have 40062 entries with 7 features (see [section 0](sec0_data.ipynb) for details). The data types range from `object` (likely denoting strings) to `int64` (integers) to `float64` (floating-point numbers). This is a reasonable amount of data to work with.

**Exercise 1.5**

The Porter and Snowball stemmers are largely comparable, while the Lancaster stemmer is the most aggressive. As a result, the Lancaster stemmer is likely to have the most trouble on a larger set of tokens.

**Exercise 1.6**

Answers may vary. Some possible observations include the fact that stemmers tend to remove affixes (such as `-ing`, `-ed`, and `-s` in English) and the fact that irregular words are particularly likely to give the stemmers trouble.

**Exercise 1.7**

Answers may vary.

**Exercise 1.8**

Answers may vary. Some possible entity labels include `GPE` ("nationalities or religious or political groups"), `TIME` ("times smaller than a day"), `QUANTITY` ("measurements, as of weight or distance"), and `WORK_OF_ART` ("titles of books, songs, etc.").

**Exercise 1.9**

Sample code solution:

In [37]:
# load trained pipeline
nlp = spacy.load('en_core_web_sm')

# perform NER on random sample in both original and lower case
sample = text_df['text'].sample(1).values[0]
doc = nlp(sample)
print('ORIGINAL CASE')
spacy.displacy.render(doc, style='ent', jupyter=True)
print('\nLOWERCASE')
doc = nlp(sample.lower())
spacy.displacy.render(doc, style='ent', jupyter=True)

ORIGINAL CASE



LOWERCASE


Answers may vary depending on the samples chosen. This sample demonstrates that the model sometimes confuses organizations with people. Additionally, it shows that the model often fails to recognize organization names (especially abbreviated ones) when they are converted to lowercase.

**Exercise 1.10**

Answers may vary.