# Dataframe from CSV on the web

by Koenraad De Smedt at UiB



---
The CSV format is a plain text representation of a table, where lines are rows and columns (or fields) are separated by a designated delimiter, often a *comma*. If the delimiter is a *tab* (tabulator character), the format is often called TSV.

Based on data from a real research project, this notebook shows how to:

1.  Make a Pandas dataframe from a CSV (or TSV) file on the web
2.  Make a series by taking a column from a dataframe
3.  Sort a series
3.  Select rows based on matching values
4.  Count and plot values.

We will use an online dataset which contains a list of new Norwegian compounds with *korona-* or *corona-*, such as *koronasyk* or *coronavirusfamilien*, obtained by searching online Norwegian newspapers accessible through [CLARINO](https://clarino.uib.no/iness). The dataset has three columns:

1. The name of the newspaper in which the compound first occurred
2. The date of first occurrence
3. The second part of the compound (after *korona-* or *corona-*).

More information at https://github.com/clarino/corona/tree/master/corona21.

---

## Read dataframe from CSV on Web

In [None]:
import pandas as pd
corona_url = 'https://raw.githubusercontent.com/clarino/corona/master/corona21/korona-compounds-forms.csv'

We read this online dataset into a Pandas dataframe. The separator is a *tab* character. This dataset does not contain a header (i.e. a line with column names) so we have to name the columns explicitly.

In [None]:
occurrences = pd.read_csv(corona_url, sep='\t', header=None)
occurrences.columns = ['source', 'date', 'word']
occurrences

## Sort series from column

We can now perform various operations on this dataset. For instance, we can extract only the `'word'` column. Taking a single column results in *series*, which is a one-dimensional data structure with axis labels. 

In [None]:
words = occurrences['word']
words

In [None]:
type(words)

The series can be sorted if the values are of the same type. If the values are strings, sorting will be alphabetic. If they are numbers, sorting will be numeric. If the values are a combination of types, sorting does not make sense and may therefore not work.

In [None]:
words_sorted = words.sort_values()
words_sorted

The series can also be sorted by other criteria, such as the length of the values, if they are strings, for instance. In order to do this, we use a *lambda function* which specifies the sorting criterion. This lambda function coerces each value `x` to a string by `.str` and then uses `.len` to take the string’s length.

In [None]:
words_sorted = words.sort_values(key=lambda x: x.str.len())
words_sorted

Note that each line still has its original index label. Suppose we want to take the longest word, which is at the top. The label `[0]` still refers to the original first row, but if we want to take the first row in the new ordering, we have to use `.iloc[0]` instead.

In [None]:
print(words_sorted[0])
print(words_sorted.loc[0])
print(words_sorted.iloc[0])

### Exercise

*   How can you obtain the last word in the sorted series? Try.

---

## Match values

Suppose we want to know which proportion of new words contain *virus*. We select only rows which match a string (or regular expression).

In [None]:
virus_words = words[words.str.match('virus')] # regex match
print(virus_words)
print(len(virus_words)/len(words))

We can limit the precision of the proportion in terms of the number of decimals that are printed. There are several ways of doing that. Here we use a *formatted string literal* (also called *f-string*).

In [None]:
print(f'{len(virus_words)/len(words)*100:.2f} percent')

### Exercise

*   Select only words containing virus followed by at least five arbitrary characters. How many are there?

---

## Value counts

In which newspapers were new words first observed? To get an overview, we can count the number of times each newspaper source occurs in the dataset. Pandas has an easy built-in method for counting values. This results in a frequency list which has the form of a series with the newspapers as indices.

In [None]:
source_counts = occurrences['source'].value_counts()
source_counts

Plot these counts. We could import `matplotlib.pyplot`, but Pandas has an easier `plot` method.

In [None]:
source_counts.plot(kind='bar', title="Corona new compound counts per newspaper")

### Exercises

1.  Sort the value counts of the sources in *ascending* order. Check the [documentation for `.value_counts`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.value_counts.html). Then plot again.
2.  (optional) Compute the distribution of word lengths and plot.
3.  (optional) Plot the number of new words per date.