# Warm-up challenges for week 2

Now that we've seen how Python and Jupyter Notebooks work and that you have read about Digital Analytics and Computational Social Science, it's time for you to combine apply this knowledge. This week has three preparatory challenges. 

Each challenge has two components:
1. **Programming and interpretation**
3. **Reflection**

**Some important notes for the challenges:**
1. These challenges are a warming up, and help you get ready for class. Make sure to give them a try on all of them. 
2. If you get an error message, try to troubleshoot it (using Google often helps). If all else fails, go to the next challenge (but make sure to give it a try).
3. These challenges are ungraded, yet they help you prepare for the graded challenges in the portfolio. If you want to be efficient, have a look at what you need to do for the upcoming graded challenges and see how to combine the work.

### Facing issues? 

We are constantly monitoring the issues on the GitHub to help you out. Don't hesitate to log an issue there, explaining well what the problem is, showing the code you are using, and the error message you may be receiving. 

**Important:** We are only monitoring the repository in weekdays, from 9.30 to 17.00. Issues logged after this time will most likely be answered the next day. 


### Using Markdown

1. Make sure to combine code *and* markdown to answer all questions. Mention specifically the question (and question number) and the answer in markdown, relating to the code and the output of the code. For the graded challenges, failing to do so will impact the grade, as we will not be able to see whether you answered the question.
2. For every line of code, please include a cell in MarkDown explaining what the code is expected to do.


## Challenge 1


### Programming challenge

For this challenge we will work with social media data. It can come from any of the sources included on Canvas. You can select a dataset yourself. Load that file using ```pandas``` and:
1. Display the first 5 rows
2. Check which columns the dataframe contains
3. Check which columns have missing values
4. Check how many entries are available in the dataset. What does each entry mean?
5. Calculate the average and the standard deviation of a form of engagement (e.g., number of likes, comments)
6. Calculate the average, minimum and maximum number of another engagement metric.
7. Indicate the most popular entry (comment, post) in your dataset.


### Reflection

Wagner et al. (2021) discuss a set of important challenges for measurement in what they call *algorithmically infused societies*. Select one relevant challenge that they discuss in their text, and relate to the social media data you have just loaded and reviewed. How are the measures that you have just reported to stakeholders (in the interpretation section) affected by this challenge? Please motivate your response, and be as specific as possible.

## Challenge 2

### Programming challenge

Download your own data from a digital platform in **JSON** or **CSV** format. You can use Facebook or Instagram data (which make the data available almost immediately), or other platforms (e.g., Google, TikTok, Spotify etc.) - but be mindful of how much time the platform says they will take to make the data available to you.

Download the data in your computer, and select one of the files that has advertising-related data (or profile interests) - but **not** your posts, personal, or network (friends or followers).

After finding this file, move it to the appropriate folder where you are running the notebook. Load it in Pandas and: 

1. Display the first 5 rows
2. Check with columns the dataframe contains
3. Check which columns have missing values
4. Summarize the information for at least one column (if it is numeric, descriptive statistics should do, if it is categorical or text, then counting frequencies and showing the top 5 items is enough).

**Note:** as shown in the videos, sometimes you may need to apply the function ```expand_dictionary``` to get meaningful data. We will cover this in more detail on DA3. If the function does not work, log an issue on GitHub with the problem (if there's still time before the submission) or select a different file in your data download package. 

### Reflection

Salz & Dewar (2019) discuss a set of important chalenges in their proposed ethics framework. Imagine that you will conduct a research project and ask multiple respondents to donate their platform usage data to you. Please select two challenges from those suggested by the authors, and explain how these challenges relate to doing research using these data donations.  

## Challenge 3

### Programming challenge

Social media data and datasets with other types of digital traces are often "messy" - they include missing data, duplicate information or unwanted observations. In the upcoming weeks, you will learn how to deal with such data.

The dataset you are now working with is also an example of such a messy dataset. One of the steps to clean it involves removing unwanted or irrelevant observations. Next week, you will continue working with the social media data and will focus on analysing text. To be able to e.g., classify the posts or comments or analyse their sentiment, you will be asked to focus on entries in one language. To prepare for next week, in this challenge you need to select only rows that are in one language (that have language set to English (i.e., en)). You can choose any language that is substantially represented in the dataset. Make sure to save this selection. If your dataset does not contain information on text, you will need to add it yourself.

*Tip*: You can use langdetect to automatically detect the language of a text (see example below).

### Reflection

For challenges 1, 2 and 3, you have done many data science activities, from collecting tweets or your own data, loading the data, inspecting and cleaning the data. Using the CRISP-DM explanation found in Larose 2014, provide a short overview of the actions you took, and to which step(s) of the CRISP-DM process they belong. Motivate your answer.

In [None]:
#Example langdetect

!pip install langdetect

from langdetect import detect

df['language'] = df['post_text'].apply(detect)


