# DS HELPER TEAM - R basics

In [None]:
%load_ext rpy2.ipython

In [None]:
%%R
library(dplyr)

In [None]:
%%R

movies = readr::read_csv("https://raw.githubusercontent.com/almenal/imdb_data/main/movies.csv")
movies

In [None]:
%%R 
head(movies$votes)

In [None]:
%%R
head(movies$name)

There's the data! Feel free to explore it.

As a starting tip, what you have above you is a "data frame" (or tibble), which is a fancier term for "Excel table". It has 6,820 rows and 15 columns, each of a specific data type: numeric (or `dbl`), characters (or `chr`), etc

Each row of the dataframe is an observation, or data point. Each observation has values from different variables, which correspond to the columns. For example, the first row shows information on the movie "Stand by Me", like its budget, its director, the year it was produced, etc.

In [None]:
# colnames() returns the names of tehe columns of the dataframe. 
# You can use it to have a better ovweview of what kind of data is shown in the dataframe

%%R
colnames(movies)

In [None]:
%%R
# Manipulating data:
# If you want to extract an entire column from a dataframe, you can do so by 
# using the $ notation: specify the column name after the name of the dataframe and a $ symbol

countries = movies$country

# head() function is usually how we have a brief look at the data. Often the case that we don't actually want to know all the detail, but a general idea or a few examples 
# about the data we have
head(countries)


In [None]:
%%R
# How would you extract the information on what year the movies were released?

#### Your code here ####


## Situation
You have been working tirelessly for the poster presentation and are finally done. You did so well that you want to reward yourself with a nive movie night. **BUT**, you have been working sooo hard, that you want to choose the **best** movie possible.

And for that, you will use the power of Data Science!

We have given you a dataset with more than 60K movies to choose from. For sure, the ideal movie will be there.


## Step 0: Too many variables!

As you have seen before when we used `colnames()`, the dataset has a lot of information on each movie, perhaps even too much. We can start our journey by focusing on variables that we really care about. For example, the `revenue` of the movie, or its `budget` might not help a lot in choosing the movie, so we can leave them out of the datset for now.

Use the function `select()` to tell R to keep only certain columns from a data frame. The way you use the function (the syntax) is to first provide a data frame, and then a vector with the variable names to keep:

`reduced_dataframe = select(dataframe, c(var1, var2, var3))`

In [None]:
%%R

narrower_df  = select(movies, c(country, director,
                                # ... other variables here
                                ))


## Step 1: Who are the only ones?
The first thing that you might want to is to see what type of movie genres there are. That information is located in the `genre` column of the dataframe, but you might want to summarise it to make your life easier. For instance, you don't need to look at the 60K rows to see all the possibilities, you only need to see what possibilities there are.


In [None]:
# By using the unique() function, we can see what the unique values of a character vector are, without repetitions


my_genre = 

You can use the same function again to see the countries where the movies were produced in. Do you want to go for a Hollywood movie? Or keep it british? Or perhaps going somehwere completely different? You decide! 

In [None]:
# Your code here


my_country = 

## Step 2: Hide and SEEK!

Now that you have narrowed down the movies to a specific genre and country, you can `filter` the dataframe to contain only the movies from the country and of the genre you have selected.


In [None]:
# The filter function needs two inputs: the dataframe to filter and the condition used to filter

my_genre_movies = filter(movies, genre == my_genre) 
# Note that to check for equality we need to use two "="  

In [None]:
# Your coding time.
# Let's keep filtering the country condition(Well, if you don't care where the movie is produced, just pretend you care for this time. Afterall, it is a learning course)

## tips: Genre has been previously filtered out. We could make use of that





## Step 3 Still too many?

Apart from categorical variables (like `country` and `genre`), we can apply other filtering criteria to numerical variables. For example, I get sleepy during movie night, so I cannot watch anything for more than 2h and a half. If that's your case as well, you can `filter()` out the long movies by using the `<`, `<=` `>` and `>=` operators. The syntax is the same as above!

In [None]:
%%R

my_runtime = # NOTE: The runtime variable gives the duration of the films in MINUTES!

## Your code here ##

## Step 4 Who's the best?

Alright! So far we hav narrowed down our search to only a few movies, but somehow they're still too many to look up manually! You have a refined taste in movies, so in order to make your life easier, you can start by looking at the movies with the highest score.

Luckily, `R` has your back. Use the `arrange()` function to sort a dataframe according to the values in a column. 

`sorted_df = arrange(df, variable)`

The default behaviour is to sort them in ascending order, which is the opposite of what we want. To fix that, we have to add `desc()` around the variable that we specify:

`sorted_df = arrange(df, desc(variable))`

Now it's your turn to sort the dataframe according to score!

In [None]:
%%R

## Your code here ##

In [None]:
## BONUS --- 
# The function slice() selects rows in a dataframe. 
# In this case you could just use head(df, n) to see the first n rows of the dataframe
# but if you wanted to save them into an object to us it later, you should
# use slice()

%%R

slice(movies, 1:4)

## Congratulations!

You have succesfully used R's data wrangling package `dplyr` to navigate a big dataset and arrange a perfect movie night. Sadly, there's no R implementation that would help you make popcorn... yet! But still R helped.

Let's go over what we have learned so far:

- Data frames are the most popular way of representing tabular data in R
- The `dlpyr` package is a useful tool to handle data frames
- With `select()` you can tell R which variables you want to keep
- `unique()` gives you the unique values of a character vector (useful to know the different labels of a categorical variable)
- `filter()` helps you keep only rows that meet a condition
- `arrange()` sorts a dataframe according to a variable

## Final exercise: putting it all together... with for loops!

It is common in any programming language to do the same task many times changing a parameter. For example, in the case of a wet-lab scientist, you might be performing a t-test to see if the growth of cancer cells under different treatment differs from the control. You could write the code for the t-test many times (one for each treatment), or you can write the code once, and tell R to change the treatment group each time.

The idea of a for loop is to avoid writing the same code multiple times and instead just state what you want to do, and what parts you want to change in each round. This is helpful in two ways: first, it makes your code more concise and easier to read; and second, there is a high chance that by copy-pasting the same code and changing one or two variables you will make some mistake along the way.

### Syntax

To use a for loop in R we need 3 ingredients:

1. The parameter that we will change every time
2. The code we want to repeat
3. Some place to store the results

And the syntax is:

```r
parameters_vector = c(5,10,15,20)
results_vector = c()
for (i in 1:length(parameters_vector)){
  variable = parameters_vector[i]
  step1 = do_stuff(variable)
  # ...
  stepN = do_more_stuff(step1)
  results_vector[variable] = stepN
}
```

1. The vector `parameters_vector` specifies the different paramters we want to apply to our code.
1. The next line of code creates an empty vector (`results_vector`) that we will use to store the results generated in the for loop. 
1. In the first "round" of the for loop (a.k.a the first **iteration**), the variable `i` will take the value 1. We use this as an index, to access the first element in `parameters_vector` (in this case, the number 5), which will be stored as `variable` and given as a parameter to the function `do_stuff()`. 
1. The result of this is stored in `step1`, which can be used in other functions to produce more results, etc. 
1. After a series of steps, we want to store the last result produced (`stepN`) somewhere, and for that we use the empty vector. Recall that the square brackets is used for indexing a vector, so we are telling R that the $i^{th}$ element of `results_vector` is the result produced in the $i^{th}$ iteration of the for loop.

Now that you know this, let's put it to practice.

### Exercise
You already have your movie night set-up, but there's still some time left, so you **obviously** decide to spend your free time exploring the data with R :D

Specifically, you're curious about knowing the amount of films of each category of age rating. You can put into practice the functions we have learned so far!

In [None]:
%%R
movies = readr::read_csv("https://raw.githubusercontent.com/almenal/imdb_data/main/movies.csv")

# 0. select() to select only age rating column
reduced_df = # ...

# 1. use unique() to get unique values of age rating
unique_age_ratings = # ...

# 2. You will need to store the results of your analysis somewhere, 
# and for that we first need to create an empty vector
movies_per_age = c()

# for loop:
for (j in 1:length(unique_age_ratings)){
    # Use j to index the j_th_ value of unique_age_ratings

    # Filter the reduced dataframe to include only
    # movies that match the age rating of this iteration

    # Get the number of rows of the dataframe with nrow()

    # Store the number of rows into the j_th_ element of the results vector
}

Now you can either "print" the information that you have gathered, or represent it graphically as a barplot!

In [None]:
# Give the vector names to put labels in the barplot 
names(movies_per_age) = unique_age_ratings
barplot(movies_per_age)

### Further reading

There are still a lot of useful functions that the `dplyr` package can offer. [You can check out its webpage](https://dplyr.tidyverse.org/) for futher info, it's pretty beginner-friendly.

Also, for general reference there's the fantastic book [R 4 Data Science](https://r4ds.had.co.nz)

# Next session... Statistical tests, fake or fact!