# FSDS24 - Week 1, Lab 3: Merging and reshaping data

In this lab, we will be including two files from the Movie Stack Exchange. The original `'movie_stack_df.feather"`, but now also `'movie_stack_df_users.feather'`. Depending on your choice of approach to questions 1-3 you may not need the Users database. Both should be available on Canvas.

To remind, you can see the schema for this data at the following URL: 
https://meta.stackexchange.com/questions/2677/database-schema-documentation-for-the-public-data-dump-and-sede

This lab will also be the first formative exercise to be submitted for feedback. The details about the deadline will be in the 'assignment' on Canvas. Provisionally, this is due on Wednesday at 5pm. This allows for multiple sessions with the TAs in order to explore and understand your answer to this question. 

In this lab you will want to merge the posts data in with users data in such a way that you will be able to make some statements about the users that you would not be able to otherwise. Depending on your skill set with managing data, you may want to scale your ambitions accordingly. However, you should be able with some assistance from the web, chat agents, and your peers to draft your own code in order to make some meaningful claims.  

For this exercise you will want to display skills in merging, filtering, and aggregation as well as give due regard to operationalisation. We will introduce time series data next week so we should try to consider approaches that do not rely heavily on time series operations. That said, you should be able to filter the data by datetime with little difficulty should you want to explore relationships over time. 

# Exercise 1. Considering units of analysis: Posts vs Users 

In the posts table, each post is a unit of analysis; i.e. a row in our data. Yet, the posts were made by user accounts. We can ask questions of both, either independently or together, where appropriate. In this first exercise we want you to describe a research question that focuses on specific units of analysis.

**First** construct a research question (that can be considered with this data) where our unit of analysis is the post. 

Some examples: 
- Comparative: Which is longer: a post or an answer?  
- Correlational: Do posts with more words lead to answers with more words? 

**Second** construct a research question where the unit of analysis is the user. 

Some examples:
- Comparative: Do users with a website URL answer posts more often?
- Correlational: Do users who write many posts also answer many posts?



## Answer 1. 

RQ1. ...

RQ2. ...

### Evaluator's comments 
 > These will be terse and focus on whether the unit of analysis is correct and whether the other elements make sense as something to be explored in this data.

^^^ 

# Question 2. Operationalisation

For only one of these two research questions describe how you will operationalise your concepts.

Consider: 
- Inclusion / exclusion criteria;
- How you will establish your unit of analysis derived from the data. For example, for 'longer' posts does this include URLs? Is it by words or by characters? Would that make a difference worth articulating?
- What sort of approach might help you establish a statistical difference?


## Answer 2. 

...

### Evaluator's comments 
 > These will be terse and focus on whether the details used to clarify the operationalisation are clear enough that one can see how this could be accomplished with the SE data.

 ^^^

# Question 3. Performing an operationalisation 

You are welcome to simply answer the research question if you are curious, but here we have a more modest goal: perform an operationalisation on the data. That is, take one of the proposed measures and get the data in a form where you can provide descriptive statistics about that concept as a variable. For example, for word length you would first count the words per post, which would be either its own column in the DataFrame or its own Series and then plot the distribution of word length. Document any meaningful steps you had to take. In the word length case, we might document how we split the posts into words. 

In [1]:
## Answer 3. 

...

Ellipsis

### Evaluator's comments 
 > These will focus on the clarity of the description, the extent to which the description matches both the data and the concept, and the clarity of any visualisation of the operationalised feature. Consider discussing the code in terms of FREE.

^^^

# Question 4. "Ackshually": Partitioning and Aggregating data 

This is a guided analysis of this data which may or may not be directly useful for your specific research question, but will include several interesting steps involving merging and aggregation. This particular analysis will make use of the fact that comments are threaded, meaning that they do not simply have a post associated with them, but potentially also another comment to which they are replying. 

There is now an internet meme about people who like to chime in and correct others, often presumably starting their comment with "Actually, ...". The meme is typified on knowyourmeme with the standard tropes of neck beard, etc... See: https://knowyourmeme.com/memes/ackchyually-actually-guy

In computational social science we can think about structured communication as a series of roles. (See https://www.cmu.edu/joss/content/articles/volume8/Welser/ for an example of how others have operationalised roles). 

Here we will operationalise roles into 4 different signatures. One of these signatures we may associate with the "Ackshually" meme. To identify these signatures we will need to merge the data...with itself. 

Below I describe the four roles we will identify. These should be mutually exclusive. 

1. People who never created any content but have a column in the user_df
2. People who _only_ create a post but never an answer. 
3. People who only create an answer. 
4. People who create both posts and answers. 

Now here's where it gets tricky: 
I want you to separate out role number 3 into:
3.1. People who only create an answer that is a reply to another answer.
3.2. Everyone else who only creates an answer. 

Then let's find out if people in 3.1 are more prone to using the word "actually". As in, these are the people who never ask questions or even provide a useful answer at first, but swoop in to correct someone else's answer. 

Below I explain why this requires you to merge the data in with itself:
- Each post has an "`Id`" column indicating its unique index in the data. 
- Each post has a "`PostTypeId`" to delineate whether it is a question or an answer. 
- Each post has a "`ParentId`" column indicating (if it is a reply) what post it is in reply to. If there is a `'Id'` of 3 for one post, and a `'ParentId`' of 3 for another post, that means that the second post is a reply to the first one. 

You will need to get the PostType for the _parent_ of every answer and merge it in with that answer. Then, if the PostTypeId of the Parent is also 'answer', then this post is a "reply to an answer". It is a one to many merge because you are merging the one PostTypeID of the parent into the many answers. 

Then you have to mark in a separate column the mentions of "actually".  

Finally, you have split the data into the roles above. There are many possible approaches to this task. Mine would be to mark each post as "question", "answer to a question", "answer to another answer". Then I would sum these per user. Then I would find the roles in this aggregated user data set:

Pseudocode: 
- if (count of "question") and (count "a" == 0) and (count "aa" == 0): group_1
- elif (count of "question" > 0) and (count "a" == 0) and (count "aa" == 0): group_2
- elif (count of "question" > 0) and ((count "a" > 0) or (count "aa" > 0)): group_4
- elif (count of "question" == 0) and (count "a" == 0) and (count "aa" > 0): group_3_1
- elif (count of "question" == 0) and ((count "a" > 0) or (count "aa" > 0)): group_3_2

Then once I have these roles, I can merge them back into the posts data, and aggregate the mentions of "actually" by role type. Then we simply report the findings. You encouraged to try a Chi-Square test of independence on the table with groups 2,3.1,3.2, and 4. If the test is not significant it should mean that no group is more likely than another to use this word. The test is somewhat sensitive to cells with less than 5 in there so this may or may not be suitable, but consider some way to assess whether this we can say there are more observed than expected instances of mentioning the word. 

> Please note that this lab is on real data with no prior completion of the lab when setting it. Thus, there might not be any interesting relationship between the groups I describe and the word "actually" or indeed maybe not even any people in one of the groups I will ask you to create. This is not a trick...this is a deductively created exploration of this data set.

In [None]:
# Step 0. Load your data into DataFrames

posts_df = ...

In [None]:
# Step 1. Create an "actually" column. 
# This can be True or False depending on whether the text contains the word "actually". 
# Structure this in such a way that you can just as easily ask for a different word
# This means you should probably create a function and use "check for word" as some parameter, then 
# send check_for_word("actually") so later you can check for another word or phrase.

posts_df["actually"] = ...

In [None]:
# Step 2. Identify which rows refer to questions and which refer to answers 

## For this you can use the post schema: https://meta.stackexchange.com/questions/2677/database-schema-documentation-for-the-public-data-dump-and-sede/326361
## To be more direct you will want to filter the data in some manner based on `PostTypeId`
## You might want separate `is_post` and `is_answer` columns or perhaps just filter `PostTypeId` removing other rows

...

In [None]:
# Step 3. Identify which answer rows reply to an answer or a question
# 
# For this, look to the `ParentId` column. 
# Now here it can get a bit tricky. You will want to merge in the PostTypeId of the parent's row into the child's row.
# Then you can filter the dataset into top-level answers and other other answers. 

...

In [None]:
# Step 4. Create role-specific labels 
# See explaination above

...

In [None]:
# Step 5. Aggregate the data into the different roles and report on it.
# See explaination above 

...

## Evaluator's comments

> These will focus on the clarity of the code and the plausibility of the result given the code. 

^^^

# Exercise 5. Testing a research hypothesis 

In this part, please reuse your code pipeline in question 4. Except instead of using the word "actually", use a word, phrase, or feature of the body text which you think might credibly differ between these four different classes of users (not five since the one's who do not post would not count). For example, in the Movie SE, one might inquire about the use of the word "cinematic", for example. Posit a hypothesis that suggests there is a significant difference in the use of `[feature x]` between these four classes. Then report the table and the Chi-square results. Please, try to think of an interesting word/feature and then use that, rather than iterate until you find one that is sigificant. However I understand there is some rationale to exploring different possible words. But try to treat this as a deductive task. 

## Answer 5: 

My hypothesis: ... 

My rationale: ...

In [2]:
# Answer 5 Here

...

Ellipsis

My explanation of the results: 

## Evaluator's comments: 

> These will focus on the plausibility of the relationship given the code, the clarity of the hypothesis, and the effectiveness of the approach to 'just exploring'. 

^^^