# Lab 2

**Goal**: In this lab we will learn how to import our own data into python, build networks, and explore how our modeling choices can affect the analysis and interpretation of the data.

**Before we get started**:
1. In GitHub Desktop, make a new branch titled `comp-activity-2` for your repository `BIOL4559homework`.
2. Put this notebook in the repository.
3. Activate your conda environment by running `conda activate biol4559` and run jupyter using `jupyter notebook`.

**After the completion of this lab**
1. List the people with whom you worked on this lab.
2. Make a pull request and title it "Grade Computational Activity 2".
3. Assign @AbhayGupta115 as a reviewer.
4. Submit the pull request.

**How you will be graded**

This activity is out of 10 points. You will be graded on two things: completion, clear documentation, and correctness of your process.

* *Completion (4pts)*: This is solely whether you attempted and completed all portions of the computational assignment.
* *Documentation (4pts)*: Write comments above your code, describing what it does to demonstrate that you understand what the code is doing. ** Examples of good and bad documentation below.
* *Correctness (2pts)*: Does your code run and produce the correct output?


# Polars

The two main libraries used for data analysis in python are pandas and polars. Pandas is the more widely used library, but polars is faster and more efficient for larger datasets. Here we will use polars to import and prepare our data for use in networkx.

Lets start by importing the necessary libraries.

To import data using polars, we use the `read_csv` function. This function takes in the file path of the data and returns a polars dataframe. Here we will import the beetle data (cook_social_2020.csv) provided to you.

Now that we have our data imported, we can start to explore it. The beetle data contains observations of social interactions between beetles. Each row in the dataframe represents an interaction between two beetles, identified by their `focal_id` and `partner_id`. The `condo` column indicates the location of the interaction, and the `datetime` column indicates when the interaction occurred.

For now the datetime column is represented as string. We need to convert it to a datetime object so that we can filter the data by date. We can do this using the `to_datetime` function. We then replace the existing datetime column with the new datetime object.

Our data also contains information about unknown beetles, identified by the `focal_id` and `partner_id` values of 'UK', 'UKM', and 'UKF' (look at the ReadMe file). We will filter these out of our data using the `filter` function. This funtion takes in a boolean (True/False) expression and returns a new dataframe with only the rows that satisfy the expression. We can use bitwise operators to combine multiple boolean expressions as well.

Bitwise operators are used to combine multiple boolean expressions. The most common bitwise operators are:
* & (and): Returns True if both expressions are True
* | (or): Returns True if either expression is True
* ~ (not): Returns True if the expression is False



Now that we have cleaned our data, we can start choosing which data we want to use to build our networks. For this lab, we will focus on the interactions that took place in condo '6B' during the month of July 2018. We can filter our data using the `filter` function again.

We can use a one-line function to do the above task using bitwise operators.

# Creating Networks

Now that we have our data filtered, we can start to build our networks. We will use the networkx library to build our networks. The first step is to create an empty graph using the `Graph` function (Just like in Lab 1).

Now we will add nodes to our graph. We will add all the unique `focal_id` values as nodes in our graph. We can do this using the `add_nodes_from` function to add all the nodes at once.

Now we will add edges to our graph. We will add an edge between two nodes if there is an interaction between them. We can do this using the `add_edges_from` function to add all the edges at once. For now we will edges for all types of interactions. We will filter for specific types of interactions later. We use `rows()` to convert the polars dataframe to a list of tuples, which is the format that the `add_edges_from` function expects.

A faster function to use would be the `iter_rows` function, which returns an iterator over the rows of the dataframe. This is more memory efficient than converting the entire dataframe to a list of tuples.

We can now draw the network using the `draw` function. This function takes in the graph and some optional parameters to customize the appearance of the graph.

Lets print some statistics about our network, like the mean degree, clustering coefficient and the density of the network.

Seems like our network is very well connected. This is likely because we included all types of interactions in our network. Let's try filtering for specific types of interactions and see how that affects the network. Let's start by filtering for only "Touch Partners" interactions.

Lets reuse the previous code to print out the new statistics for this network.

Now the network is much less connected. On average it seems like each beetle only touches 5 other beetles. This is likely a more accurate representation of the social structure of the beetles. We can also try filtering for other types of interactions and see how that affects the network.

Then we can start having some fun, we can try and figure out if there is any correlation between the different types of interactions. For example, do beetles that touch each other also tend to be mates? We can use the Jaccard index to calculate the similarity between the two networks. The Jaccard index is defined as the number of edges in the intersection of the two networks (i.e. present in both networks) divided by the number of edges in the union of the two networks (i.e. all the edges present in total from both networks).

Here is an example of how to calculate the Jaccard index using two example networks.
```python
# Creating network 1
G1 = nx.Graph()
G1.add_nodes_from([1, 2, 3, 4, 5]) # Adding nodes
G1.add_edges_from([(1, 2), (2, 3), (3, 4), (4, 1)]) # Adding edges

# Creating network 2
G2 = nx.Graph()
G2.add_nodes_from([1, 2, 3, 4, 5]) # Adding nodes
G2.add_edges_from([(2, 3), (3, 4), (4, 5)]) # Adding edges

# Calculating Jaccard index
edges_G1 = set(G1.edges())
edges_G2 = set(G2.edges())

intersection_size = len(edges_G1.intersection(edges_G2)) # Number of edges in both networks
union_size = len(edges_G1.union(edges_G2)) # Number of edges in either network
jaccard_index = intersection / union

print(f'Jaccard index: {jaccard_index}')
```

## Graded Activity

* Load the data from CookSocial2020 dataset and clean the data to remove any bad values (like unknown males, females).
* Create a network for the year 2020, and a condo of your choosing (mention which condo you chose).
* Draw the network for different mating interactions (Mating Partners, Touch Partners, and 5 CM Partners), try to remove any isolates. Tell us what do you think, should isolates be considered while analysing networks or not and why? (Hint: Check the documentation for networkx to see how to remove isolates).
* Check how the network statistics (degree, clustering coefficient, and density) change with different mating interactions.
* What do these statistics tell you?
* What would these different representations of the same data be useful for? State applications for each type of network you draw.

(Optional, Ungraded activity) Check whether different interactions have a correlation, does a beetle with more Touch partners also get more Mates? (Hint: Calculate the Jaccard similarity between the edges of the two networks).

**Student answer here**