# Workflow with corpus data normalization

by Koenraad De Smedt at UiB

---

The starting point for this notebook is the simple hypothesis that certain words are characteristic for male or female teenage speech. Data is obtained from COLT (Corpus of London Teenage Speech, transcribed) accessible through [Corpuscle](https://clarino.uib.no/corpuscle), a service in [CLARINO](https://clarino.uib.no/iness).

To complicate matters, the total corpus material for male speakers does not have the same size as that for female speakers. So we cannot compute percentages directly from absolute counts. To compare groups fairly, *weighted* percentages must be obtained that take into account the different group sizes.

A *workflow* with the following steps is described in this notebook.

1.  Compose a query that you can use to search a corpus
2.  Download raw frequencies and read them into Python
3.  Normalize the data to produce weighted percentages
4.  Visualize numerical data in a styled, sorted dataframe and a barplot
3.  Write text, tables and figures to files for inclusion in a LaTeX document.

---

## 0. Querying the corpus

We start by writing two strings of words separated with commas. The first string has words which are assumed to be characteristic for boys’ speech and the other has words characteristic for girls’ speech.

In [None]:
boyswords = 'beat, bloke, cool, crap, football, music'
girlswords = 'cat, clothes, kiss, love, model, phone'


Convert the strings to lists and concatenate those.

In [None]:
gwordlist = girlswords.split(', ')
bwordlist = boyswords.split(', ')
allwords = gwordlist + bwordlist
print(allwords)
print(len(allwords))

Join the words with `|` to formulate a query.

In [None]:
query = '|'.join(allwords)
print(query)

Copy the query, sign in to [Corpuscle](https://clarino.uib.no/corpuscle) (and accept the License if you have not done it before), select only the ICAME collection and the COLT corpus, paste the query in *Search expression* and run it. Observe the Concordance.

Then compute the Distribution of *gender* relative to *word*, ignoring case and with Type: *absolute*. Notice that some occurrences in the corpus lack a value for gender.

Showing *percentages* gives unweighted percentages for each gender and each word. This is however not a good basis for comparing gender differences because the total size of material for each gender group is not the same. We therefore need to adjust the numbers in Python.

Download the *count* values. This will give you a file like `distribution.txt`, which has tab-separated counts for each word. Put this file in your Google Drive folder, or upload it to Colab session storage, and inspect it. Note that the first line is a comment, the second is a header with column names (but the name of the last column is missing) and the third line has column totals (which we don't need).

## 1. Loading the data

Mount Google drive, if that is where you have put the file with the absolute counts that was downloaded from Corpuscle. Otherwise skip.

In [None]:
# skip this cell if you are running Python locally
from google.colab import drive
drive.mount('/content/drive/')

Change the following to your path and filename. Make sure they exist.

In [None]:
data_path = 'drive/MyDrive/Colab Notebooks/ling123/data/'
data_file = 'distribution.txt'


Read the file with the absolute (raw) counts into a Pandas dataframe. Skip lines with comments. The first non-comment line is by default the header. Also skip the row which contains the totals.

In [None]:
import pandas as pd

rawtable = pd.read_csv(data_path + data_file, comment='#', sep='\t', skiprows=[3])
rawtable

Note that the name of the last column is missing in the file. That column has occurrences without a gender. We are not interested in this column anyway, so we drop it. The `inplace` parameter means change the current dataframe rather than making a copy.

In [None]:
rawtable.drop(columns=['Unnamed: 4'], inplace=True)
rawtable

We use the `Word` column as the index. The result is a *crosstable* with word labels on rows and gender labels on columns.

In [None]:
rawtable.set_index('Word', inplace=True)
rawtable

Compute the sum of all observations.

In [None]:
obs = rawtable['Sum'].sum()
obs

## 2. Normalizing towards weighted values

Computing percentages of counts in each row would not take into account the fact that the total amounts of male and female speech in the corpus are not balanced: female speech accounts for 46.934%, male speech for 50.063% of all words in the corpus.

The following first makes a copy of the table, dropping the Sum column, and then adjusts the numbers to what they would be if female and male were each to account for 50%.

In [None]:
normtable = rawtable.drop(columns='Sum')
normtable['f'] = normtable['f'] * 50/46.934
normtable['m'] = normtable['m'] * 50/50.063
normtable

Now compute the percentages by dividing values along each row by the sum of the columns for each row and multiplying with 100. This is the main result that we wanted.

In [None]:
normpct = normtable.div(normtable.sum(axis='columns'), axis='index').mul(100)
normpct

## 3. Sorting, styling and visualizing the data

We need to make an effort to make the differences in the data easier to see.
Sort and style the dataframe. Set the precision and caption and add a background gradient.

In [None]:
sortpct = normpct.sort_values(by='f')
styledpct = sortpct.style.format(precision=1)
styledpct.set_caption('Percentages weighted by group sizes.')
styledpct.background_gradient(cmap="Blues", axis=None)
styledpct

Make a stacked plot.

In [None]:
barplot = sortpct.plot.bar(stacked=True)

## 4. Writing text, tables and figures to files

In order to present data in various ways in a paper, we save text, tables and figures to files, so that they can be input into LaTeX. Define your own path; make sure the directory exists.

Also, make a helper function for writing textual data to files in your path.

In [None]:
doc_path = "drive/MyDrive/Colab Notebooks/ling123/doc/"

def write_text (data, file):
  'Write text to file'
  with open(doc_path + file, 'w') as out:
    print(data, file=out)

 Write some pieces of textual data to files.

In [None]:
write_text(girlswords, 'girlswords.tex') # assumed girls' words
write_text(boyswords, 'boyswords.tex') # assumed boys' words
write_text(len(allwords), 'nrwords.tex') # number of words
write_text(rawtable['Sum'].sum(), 'nrobs.tex') # number of observations in corpus

Get the picture of the plot which we had kept in the variable `barplot` and save it in a file. The `dpi` indicates the resolution and `bbox_inches='tight'` makes tight margins.

In [None]:
pic = barplot.get_figure()
pic.savefig(doc_path + 'sortpct', dpi=200, bbox_inches='tight')

Write the dataframe with the absolute counts as a LaTeX table. Add a caption and horizontal rules. Add a label to refer to the table. Position the table on the page.

In [None]:
rawtable.style.to_latex(doc_path+'rawfreqtable.tex',
  caption='Absolute words counts. Sum may include occurrences without value for gender.',
  label='tab:rawfreq', hrules=True, position='hbt', position_float='centering')

Similarly, write the styled dataframe with the normalized percentages as a LaTeX table. This styled dataframe already had a caption etc. The `convert_css` option is needed to convert CSS styles (such as for color) to LaTeX-compatible formats.

In [None]:
styledpct.to_latex(doc_path+'normpcttable.tex', convert_css=True,
  label='tab:normpct', hrules=True, position='hbt', position_float='centering')

Now all there is to do is import the saved files in a LaTeX document. See, for instance, *boysandgirls.pdf* in the Files folder at Mitt UiB.

The figure can be included with `\includegraphics{file}`.
The text and tables can be included with `\input{file}`. You can refer to each table and figure with its label (see earlier notebook). Remember to write `\usepackage{booktabs}` and other necessary packages in the LaTeX preamble.

The advantage of such a workflow is that, once it is set up, it is easy to redo the whole procedure with the same words or different words, or to make minor adjustments in the program.


---

*Acknowledgements*:  The example words were suggested by Erlend Astad Lorentzen. The problem is adapted from an exercise by Knut Hofland.

### Exercises

1.  Choose different words and redo the whole thing. Due to the limited size of the corpus it is recommended to choose rather frequent words.
2.  (optional) Compute the distribution of a different attribute, such as age, instead of gender (see *Menu search* in Corpuscle).
3.  (optional) Write your own LaTeX article and include the result data and figures from your analysis.