### Class 1
This week's classroom instruction will focus on three things:
- **Part 1**: Getting the integration between UCloud and GitHub to work
- **Part 2**: Learning how to setup a virtual environment for Python
- **Part 3**: A few warm-up exercises on string manipulation

#### Part 1: Integration between UCloud and GitHub
First, follow the instructions provided in `ucloud.md`.

Once you have done that, let's look at how to use the repository with course materials. First, log in to https://github.com and go to the repository's link: https://github.com/rbroc/NLP-AU-23. 

##### Forking the repository
Click on `Fork` (top-right of the page). This creates a copy of the repository which is partly independent of the original repository. You will be submitting changes to *this* repository. 
The original repository will be iteratively updated with new materials or exercises. You can sync the two by setting them both as *remotes* in your local instance of the repository. 

##### Cloning the repository and setting up remotes
With the VS Code app on UCloud open, and a Terminal open, you can clone your repository by `cd`-ing into your private folder, and running: 
```
git clone https://github.com/your_username/NLP-AU-23
```

You can visualize your remotes by typing: `git remote -v`.
You should add my repository as a remote, by running:
```
git remote add [robertas_remote_name] https://github.com/rbroc/NLP-AU-23.git
```

##### Our workflow
As my repository gets updated, you can pull it into yours by running:
```
git pull [robertas_remote_name] https://github.com/rbroc/NLP-AU-23.git
```

You can push to your remote by typing:
```
git push origin [branch_name]
```

Once you have done this, our workflow will be the following:
- You run the github setup script (and log in with your token if needed)
- You go to your local repo
- You pull my repo
- You make changes and push to your fork

#### Part 2: Virtual environments
A good way to manage dependencies for specific projects and avoid conflicts is to create virtual environments (which contain bundles of libraries that you can "activate" and "deactivate" at will, without those interfering with your global environment). I tend to keep my virtual environments in the same place, e.g., my private folder on UCloud.

##### Creating, activating, and deactivating a virtual environment
Let's create a virtual enviroment.
1. Navigate to your private folder (`path_to_folder`)
2. Create a virtual environment folder, e.g., by running `mkdir venvs`
3. Create a new virtual environment (`nlp-e23`), by running:
`python -m venv path_to_folder/venvs/nlp-e23`
4. You can activate it by running: `source path_to_folder/nlp-e23/bin/activate`
5. To deactivate it, you can simply run: `deactivate`
Anything you `pip install` while the enviroment is active stays inside the environment. 
To make sure you have the latest pip, run:
```
pip install --upgrade pip
```
The nice thing about this is that you don't need to reinstall stuff whenever you open UCloud, just load your virtual environment.

##### Using it in a notebook
Everything you install which the virtual environment is active is fully contained inside the virtual environments.
If you virtual environment is active, scripts will use it, but an extra step is needed to use it as part of notebooks.
First, we need to install ipykernel:
```
pip install ipykernel
```

Then, we need to create a new kernel using our virtual environment:
```
python3 -m ipykernel install --user --name=nlp-e23
```

Now try to open a notebook (you can create an empty one by running `touch notebook_name.ipynb`). On the top right, you can choose a kernel. A kernel based on the virtual environment will be available.

##### Installing libraries from a requirements file
An easy way to keep track of your dependencies is to add them to a requirements file. They are generally called something like `requirements.txt`, and they are simply a list of libraries your project needs.
The content of the text file looks something like this:
```
datasets==2.12.0
pandas==1.5.3
```
You will have one library per line, and -- although it is not obligatory -- you can pin the package to the version you know is needed for your project.
You can install all the dependencies needed from your requirements file, by running:

`pip install -r requirements.txt`


#### Part 3: Warm-up exercises
What follows is a few exercises to get back in shape with Python syntax, especially for what concerns simple string manipulation.

#### String concatenation
Write a function that takes two strings as input, and returns a concatenated string (the second string is appended to the end of the first one).

In [6]:
def concat_two_strings():
    ''' Add code '''
    pass

Now write a function that can take any number of strings, and concatenate all of them. Hint: you can use a different data structure to pass the strings (e.g., an iterable), or use `*args` to pass any number of arguments to the function call.

In [None]:
def concat_any_string():
    ''' Add code '''
    pass

#### Counting substrings
Write a function that takes two inputs: 
- l, which is a single character;
- s, which is a string. 

The function should return how many times l is contained in s (an integer).

In [None]:
def letter_count(l, s):
    ''' Add code '''
    pass

Can you make a function that generalizes to l being any sequence of characters (e.g., counting how many times the substring `ab` occurs)? 
Hint: you can do it in base python, or use functions (e.g., `findall`) from the regex package `re`.

Now edit the previous function to be able to return counts for multiple strings (make *l* an iterable, e.g., a list).
The output should be a dictionary with keys *k* and values *v*, such that k are all elements of *l*, and *v* is the respective count.

In [7]:
def generalized_counter(l, s):
    ''' Add code '''
    pass

#### Loading text files

Loading a text file: the `data` folder contains a file called `book-war-and-peace.txt`.
Use the functions `open` and `read` to read the file as a single string, and split it into a list of words, based on the presence of whitespaces.
Once you have done this, try to use the function `readlines` instead: what does it do?

In [None]:
# read book-war-and-peace.txt as a string or using readlines

#### Regular expressions

After reading the file with `read`, replace all newlines (`\n`) with whitespaces first, remove any sequences of multiple whitespaces, and then all alphanumeric characters (excluding spaces). Hint: you can use functions like `re.sub` to do so (here is a guide to regex patterns: https://docs.python.org/3/howto/regex.html).

In [None]:
# Remove \n and non-alphanumeric characters

Now lowercase this text, split by whitespace, and count the number of occurrences for each word. 
Hint: 
- import `Counter` (`from collections import Counter`) to extract counts;
- transform the output of Counter into a dictionary (pass its output to `dict`)
- import pandas as pass that to `pd.DataFrame.from_dict` (set `orient=index`)
- Sort it using `sort_values`.


Visualize the top 100 words: what do you notice?

In [None]:
# add code

#### 