<a href="https://colab.research.google.com/github/sydneynichol/MyFirstPullRequest/blob/master/Data_Wrangling_2_Assignment.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Data Wrangling with R - Exercise 2

Today's goal is to practice some of the basics of working with data using R's [tidyverse](https://www.tidyverse.org/) environment.

To submit this assignment, please:
- make a copy of this notebook
- share it with me (mzettersten@ucsd.edu)
- submit a share link of the completed assignment on Canvas

For .5 extra credit, complete the following bonus questions:
- 1.7
- 2.4



##0. Setting Up: Loading the tidyverse packages
Before working through this notebook, make sure that your runtime is set to use R (rather than python). This should be set already, but you can double-check by selecting the following from the menu at the top of the notebook.

  Runtime > Change runtime type

Then make sure that R appears in the drop-down menu under runtime type.

This first part of the code just installs and loads the R packages we want to use for the following analyses.

In [None]:
#Install packages we want to use

#this loads a helper function for installing R packages
source('https://raw.githubusercontent.com/COGS119/tutorials/refs/heads/main/R/load_install_packages.R')

#specify the name of all packages we want to use her
packages_to_apt_install = c('tidyverse')

#install packages specified above (note that we're using our old friend the for loop, this time in R syntax) into our google colab environment
#For more: https://www.w3schools.com/r/r_for_loop.asp
for (package in packages_to_apt_install) load_install_package(package, apt=TRUE)

#load packages we want to use
library(tidyverse)

#set some basic plotting defaults
theme_set(theme_minimal(base_size = 18))

#check the version of R used
print(R.version.string)


## 1. Practice working with Experiment Data

Let's pick up where we left off last time, working with a (cleaned-up) version of the type of data you'll be working with in your experiment. [NOTE: Most of Section 1 is designed to be very similar to [Data Wrangling Exercise 1, Section 4](https://colab.research.google.com/drive/1ndgXH2X3WUkHbnuH3MM5cerWx9_2Hj-n?usp=sharing) - feel free to consult the code there frequently!]

This data is from [Zettersten & Lupyan (2020)](https://drive.google.com/file/d/1kpNFFcPdA9XiL4jseJtnrE3gdUOWkg66/view?usp=drive_link) (one of the experiments not chosen for a project).

Reminder: In the experiment, participants figure out which of two categories (A or B) a given image belongs to through trial and error. Over time (24 trials, organized into three blocks of 8 trials each), they start to learn the category structure. The key manipulation is a between-subjects condition: in the high nameability condition, the features of the images (in this case, colors) are very easy to name (e.g., blue, orange). In the low nameability condition, the colors are very difficult to name (e.g., sienna, mauve). The finding is that categories were easier to learn when the color features were more nameable.

The data here presents the **actual** data from the first two experiments in the paper (1A and 1B). The experiments are structurally identical - they just use slightly different arrangements of colors.

### 1.1. Read in the data

In [None]:
#read in the data we'll use
#data from: Zettersten & Lupyan (2020)
#https://drive.google.com/file/d/1kpNFFcPdA9XiL4jseJtnrE3gdUOWkg66/view?usp=drive_link
nameability_data <- read_csv("https://raw.githubusercontent.com/COGS119/tutorials/refs/heads/main/R/nameability_exercise.csv")

### 1.2. Understand the structure of the data

First, let's understand how our data is structured. Here's another way to take a first peek at your data.

In [None]:
nameability_data |> View()

The data contains the following columns:

* **experiment**: the name of the experiment in the paper (1A or 1B)
* **subject**: the unique participant id
* **condition**: nameability condition (high or low)
* **total_trial_num**: the trial number (1-24)
* **block_num**: the block number (3 blocks of 8 trials each)
* **image_name**: the name of the stimulus to be categorized
* **is_right**: whether the response was correct (is_right=1) or incorrect (is_right=0)

Is the data in tidy format? Why/ why not?

###1.3. Summarize the data from Experiment 1B

Let's **summarize the average accuracy for each participant from Experiment 1B only**, retaining information about which condition they were in.

To do so, we need to:
* filter the data to Experiment 1B
* group the data by condition (why?) and participant
* summarize the average accuracy

In [None]:
#create a data frame summarizing the average accuracy for each participant for Experiment 1B


###1.4. Plot the data

Create a plot to show the distribution of participants' accuracy in each condition. This could take several forms: histogram, jittered points, a dotplot,... entirely up to you!

In [None]:
#create a plot to show the distribution of participants' average accuracy in each condition

###1.5. Summarize accuracy by image (instead of participant)

Let's summarize the data from Experiment 1B a different way.

Instead of summarizing by participant, let's **summarize the average accuracy for each IMAGE** from Experiment 1B, retaining information about which condition they were in.

To do so, we need to:
* filter the data to Experiment 1B
* group the data by condition and image
* summarize the average accuracy

In [None]:
#create a data frame summarizing the average accuracy for each IMAGE for Experiment 1B

Now, create a plot showing the distribution of image accuracies.

**In a comment in your code, please answer the following questions:**

- What is the image with the lowest overall accuracy in Experiment 1B?
- What is the image with the highest overall accuracy in Experiment 1B?

In [None]:
#create a plot showing the distribution of image accuracies

# What is the image with the lowest overall accuracy in Experiment 1B?

# What is the image with the highest overall accuracy in Experiment 1B?

### 1.6. Summarize the entire data

For the next section, let's try to put everything together that we've seen so far, by getting a summary of how **participants in both experiments 1A and 1B**, respectively, performed **for each of the three blocks** of the experiment.

Specifically, let's
- compute the by-block average accuracy for each participant, i.e. average accuracy grouped by experiment, participant, condition, and experiment block
- for each experiment, summarize accuracy for the high and low nameability condition across participants for each block. Also compute the standard error of the mean for each block:

$SEM = \frac{SD}{\sqrt{N}}$

In [None]:
# create a data frame that summarizes participants' average accuracy
# for each experiment AND each block of the experiment (make sure to retain their condition)


In [None]:
#compute the overall group mean by experiment, condition, and block
#also compute the standard error of the mean


### 1.7 [BONUS] Create a plot!

Create a plot showing how learning unfolds across blocks for each experiment.

**Tip:** You can use [facet_wrap()](https://ggplot2.tidyverse.org/reference/facet_wrap.html) to create separate panels for a grouping variable (like experiment!) within the same plot

In [None]:
#create a plot showing how average accuracy changes across blocks for each condition in Experiment 1A and in Experiment 1B (separately)

## 2. Wrangling Experiment Data: Typical Critters

Next, we'll work with a second dataset to continue to get comfortable working with experiment data. This time, we'll look at a dataset that retains more of its original JsPsych columns. This will help us learn strategies for cleaning up and organizing JsPsych data.

In this experiment, participants judged how typical a set of images were for four animal categories: birds, cats, dogs, and fish. Each participant judged 16-20 images for 2 categories (e.g., birds and dogs).


### 2.0 Load the data

First, let's load the data and take a closer look. We're storing the data frame in the object `typical_data`.

In case you're curious, you can also view all of the individual images here:
[https://github.com/COGS119/tutorials/tree/main/R/typicality_images](https://github.com/COGS119/tutorials/tree/main/R/typicality_images)

Here's an example of a [typical dog](https://github.com/COGS119/tutorials/blob/main/R/typicality_images/beagle_600x600.jpg).

Here's an example of an [atypical dog](https://github.com/COGS119/tutorials/blob/main/R/typicality_images/afghanhound_600x600.jpg).

In [None]:
#read in the data we'll use
typical_data_all <- read_csv("https://raw.githubusercontent.com/COGS119/tutorials/refs/heads/main/R/typicality_exercise.csv")

Let's take a look at the dataset. There are a lot of columns! The dataset is still in a relatively "raw" form, with many of the columns that JsPsych adds to our data by default. This can be very helpful for keeping as much information about what happened during our experiment as possible. BUT, it can also be distracting once we go to analyze our dataset.

In [None]:
#inspect the data - lotta columns!
typical_data_all |>
  View()

Before we move on, let's use two verbs (one old, one new) to clean up the dataset a bit:
- [`filter()`](https://dplyr.tidyverse.org/reference/filter.html): `filter()` can help us keep only the rows we want to analyze further in our dataset. This is helpful for removing JsPsych trial types that we don't need for our analysis (e.g., instruction trials).
- [`select()`](https://): `select()` allows us to keep only the columns that we want to analyze further. This is helpful for removing JsPsych columns that we don't need in our current analysis (e.g., `internal_node_id`).

Here, let's use `filter()` to select only the typicality rating trials (removing instruction trials) and `select()` to keep only the key columns for our analysis. We'll store it in a new data frame, `typical_data`.



In [None]:
#filter and select the data we want to keep
typical_data <- typical_data_all |>
  filter(trial_type == "survey-likert") |>
  select(subject, trial_number,category, typicality_condition,image_name,nameability,typicality_rating)

#take a look at the new data frame
typical_data |>
  View()

We now have a data frame with 7 columns:
- **subject:** the unique participant id
- **trial_number:** the trial number for each subject
- **category:** the category the image belongs to (bird, dog, cat, fish)
- **typicality_condition:** the typicality condition for each image (atypical vs. typical), specified by the experimenter. The experimenter selected some images as being more likely to be seen as typical and some as being more likely to be seen as atypical.
- **image_name:** the name of the specific image that participant saw on each trial. See [here](https://github.com/COGS119/tutorials/tree/main/R/typicality_images) for the specific images.
- **nameability:** a specific property of the image, its "nameability" (see below for more details). NOTE: each image has its own nameability value (identical across participants), based on data collected from a *different* experiment.
- **typicality:** participants' rating of the image's typicality, on a scale of 1 (very atypical) to 5 (very typical).

### 2.1. Summarize typicality for each image

Now that we have the data in a more convenient shape, let's create a new **data frame that summarizes the average typicality for each image**.

- in addition to the `image_name`, retain information about the `category`, `typicality_condition`, and `nameability` in your grouping variable
- because every participant judges a given image just once, we **do not** need to group by participant.



In [None]:
#create a data frame that summarizes the average typicality for each image
#keep information about category, typicality_condition, and nameability


### 2.2. Visualize the data

Let's focus in on **just the dog images** (by filtering the data to just the dog category).

Create a bar plot showing the average accuracies for all dog images. Tip: use [`geom_bar(stat="identity")`](https://ggplot2.tidyverse.org/reference/geom_bar.html).

Answer the following questions:

- Which dog image has the highest typicality rating on average?
- Which dog image has the lowest typicality rating on average?

In [None]:
# create a bar plot showing the average accuracies for all dog images.

# Which dog image has the highest typicality rating on average?

# Which dog image has the lowest typicality rating on average?

### 2.3. Investigate typicality by condition

Now, let's switch gears and check to see whether our expectations about which images would be more or less typical (`typicality_condition`) matches how people rated our images.

For this part, create a data frame summarizing the average typicality ratings for each condition, so:
- group by typicality condition (only)
- summarize average typicality

What are the average typicality ratings for each condition? Do images between the two conditions differ in typicality ratings the way we would expect?

In [None]:
# create a data frame that summarizes average typicality by typicality_condition

Summarize average typicality by condition again, but this time let's break down average condition ratings by category (bird, cat, dog, fish). [How can we accomplish this?]

In the code, please also answer the following question:
Which of the four animal categories has the largest different between typical and atypical images? (it's fine to read this off the data frame you computed, but if you can figure out how to compute this in R, bonus!!)

In [None]:
# create a data frame that summarizes average typicality by category and typicality_condition

# Which of the four categories has the largest difference in average rating between typical and atypical conditions?

### 2.4. [BONUS] Relationship between nameability and typicality.

Note: Complete this next section (together with 1.7) to earn .5 extra credit on the assignment!

For each image, we also have information about its "nameability": how much people agree on what to call the image. For example, if everyone names an image in the same way (e.g., everyone calls the image "dog"), then its nameability will be high (close to 1). If everyone names the image in a different way (e.g., one person calls it a "heron", another person calls it a "crane", another person calls it a "bird", etc.), then its nameability will be low (close to 0).

Using the data frame you created in 2.1., let's look at the relationship between nameability and average typicality rating for each image. Can you create a plot that:
- shows points representing nameability (x-axis) and average typicality (y-axis)
- a linear fit showing the relationship (correlation) between nameability and typicality

**Tip:** A quick way to create a linear fit using ggplot is to use [`geom_smooth(method="lm")`](https://ggplot2.tidyverse.org/reference/geom_smooth.html).



In [None]:
# create a plot showing the linear relationship between nameability and average typicality

Next, can you create a plot showing the same linear relationship between nameability and typicality, but this time, show the linear fit separately for the typical and atypical condition?

**Tip:** add a color variable to the `aes()` call, i.e., `aes(....,color=typicality_condition)` to group the data into two different colors. This will also automatically separate the data for `geom_smooth()`.

In [None]:
# create a plot showing the linear relationship between nameability and average typicality, for each condition