# Research Exercise 2: Working with Data in Python:

In this homework, we're going to be learning how to use Python to perform some basic computational analysis of a text file and a small set of structured data. 

Some of what we'll be learning will build on [what we learned to do with the command line interface](https://github.com/sceckert/Data-and-Literary-Study-Spring2022/blob/main/_week1/introduction-to-the-command-line.md): perform basic searches across text, identify patterns, count words and lines. Python also allows us to far more robust analysis: we can write scripts that allow us to re-use code, manipulate and view data. 


## Before we get started... some Python and Jupyter Notebook Tips:

This notebook is a Jupyter Notebook. You can interact with it in a few ways: 

1. You can click on [the Binder version](https://mybinder.org/v2/gh/sceckert/Data-and-Literary-Study-Spring2022/main?urlpath=lab/tree/_week2/research-exercise-2.ipynb) (this is hosted on a cloud server)
2. You can run on your own machine through Jupyter Labs: 
	- Download this notebook and this folder of data. ( https://github.com/sceckert/Data-and-Literary-Study-Spring2022/archive/main.zip)
	- I encourage you to learn to create Jupyter notebooks on your own machine––this will give you a little more control over writing and saving your own Python code

###  Pro Tips:
- Running a cell in JupyterLab: Click on the cell, then click ► (the "Run" icon) in the menu at the top of this notbook 
- `Tab` completion. 
    - Like the command line, Python uses tab completion
    - Pressing the `tab` key on your on your keyboard will allow you to search for any variables that you've already defined, as well as matching functions or modules within python.
- Run cells in order!
    - Python executes code in the order that it's written. This means that some parts of code will depend on parts written earlier. If you get an error, it may mean that you simply haven't defined a variable or function. Make sure to run code in the sequence it's written.




### SUBMITTING YOUR HOMEWORK:

To submit your homework, save your answers in your Jupyter Notebook.

There are two ways to do this:

- **1. If you are running JupyterLab off Binder (the cloud server interactive version)**:
    - Make sure to SAVE and DOWNLOAD the notebook when you are done––the cloud server will not save your *any* of your changes!! 
    - To do this, go to the top menu in Jupyter Labs, click on the Save icon, and then and click on the "Download" button. This will download as a Notebook (.ipynb).  This is what you'll be turning in as your homework. 
![image](images/download.png)
    
    
- **2. If you would like to running JupyterLabs through your own Anaconda Navigator**:
    - Download our course files: https://github.com/sceckert/Data-and-Literary-Study-Spring2022/archive/main.zip and unzip the folder somewhere on your desktop
    - Launch Anaconda Navigator, then launch Jupyter Labs
    - Once in Jupyter labs, navigate to the folder on your desktop with our course materials, then navigate into the directory called "_week2"
    - To save any changes you make to the notebook, click on the Save icon.
    
 
 #### How do I know what way I'm runing JupyterLabs? 
 
 If you're using Binder, you will always have a small Binder icon in the the top menu, like this: 
 
 ![image](../_images/binder.png)

---

## Part 1: Command Line in Jupyter Notebooks

Jupyter notebooks can be used to type command line commands! This means that all of the command line commands you learned can also be written within a Jupyter notebook environment. 

There are a few important differences. First, all command line commands must be prefaced with an `!`. 

So rather than typing `pwd` or `cat [filepathtomyfile]` (as you would do in your terminal window), you would type `!pwd` or `!cat [filepathtomyfile]`

Let's test it out below!

In [45]:
!ls

Kafka-The-Metamorphosis.txt  python_demo.ipynb
bourdieu-dates.csv           research-exercise-2.ipynb
bourdieu-publishers.csv      walsh_goodreads_classics.csv
introduction-to-python.ipynb woolf-a-room-of-ones-own.txt


In [46]:
# type a command line command here

In [47]:
# type another command line command here

## Part 2: Python Basics

As Part 2 of this homework, please read and complete the [Introduction to Python Basics tutorial](https://github.com/sceckert/Data-and-Literary-Study-Spring2022/blob/main/_week2/introduction-to-python.ipynb) ([interactive version here](https://mybinder.org/v2/gh/sceckert/Data-and-Literary-Study-Spring2022/main?urlpath=lab/tree/_week2/introduction-to-python.ipynb) which will give you some key foundations to set us up for our exercise today. 

**Complete Exercises 1-3 in the Introduction to Python Basics notebook.**

**REMEMBER: if you work in the CLOUD version of the notebook, make sure to download a copy for your local machine--- none of the changes you make will be saved**


## Part 3: Extracting book reviewing data from variables and lists

Complete Questions 1-8 that appear below in this notebook

The dataset we'll be working with in this set of exercises comes from Melanie Walsh and Maria Antoniak's dataset of the top 144 books that Goodreads users either tagged as a “classic” the most --according to the number of times the work has been tagged in the site's history or read the most as of 2019. 

In addition to "classic," the dataset tracks whether a Goodreads user gave tagged a given title with the tag "most_shelved" or "most_read."


Before going any further, take a minute to read a little about how Walsh and Antoniak constructed this dataset, on their ["Goodreads Classics Website"](https://melaniewalsh.github.io/Goodreads-Classics/Goodreads-Classics-Table.html) We're going to be working with a small portion of it for this lesson––in later lessons, we'll be building on these skills as we learn how to interact with a larger dataset.


Here's a peek at part of the dataset: 

In [1]:
import pandas
pandas.read_csv("../_datasets/walsh_goodreads_classics.csv").head(20)

Unnamed: 0,author,title,date,ratings,reviews,average _rating,most_shelved,most_read
0,Harper Lee,To Kill a Mockingbird,1960,5.1M,100k,4.27,2.0,
1,F. Scott Fitzgerald,The Great Gatsby,1926,4.3M,80k,3.93,3.0,8.0
2,George Orwell,1984,1949,3.7M,85k,4.19,5.0,4.0
3,Jane Austen,Pride and Prejudice,1813,3.5M,83k,4.28,1.0,6.0
4,J.R.R. Tolkien,The Hobbit or There and Back Again,1937,3.3M,56k,4.28,29.0,15.0
5,George Orwell,Animal Farm,1945,3.2M,69k,3.97,8.0,2.0
6,Anne Frank,The Diary of a Young Girl,1947,3.2M,33k,4.17,37.0,31.0
7,J.D. Salinger,The Catcher in the Rye,1951,3.0M,66k,3.81,7.0,17.0
8,J.R.R. Tolkien,The Fellowship of the Ring,1954,2.5M,28k,4.37,59.0,35.0
9,William Golding,Lord of the Flies,1954,2.5M,44k,3.69,10.0,


As you complete the exercises below, think about Walsh and Antoniak are defining "classic." What kind of larger claims could be made with this data? What kind of claims would it be difficult to make with this data?


## Complete Questions 1-8

### Question 1

In [3]:
title1 = 'To Kill a Mockinbird'
title1_author = 'Harper Lee'
title1_date = 1960
title1_average_rating = 4.27
title1_most_shelved = 2
title1_most_read = 0

Write an `if` statement that reports whether title1_average_rating is greater than 4

In [5]:
## Your code here
    print('Title has an average rating greater than 4.')

Title has an average rating greater than 4.


### Question 2
Write an `if` statement that reports whether `title1_date` is 1960

In [13]:
#Your code here
    print('Title was published in 1960.')

Title was published in 1960.


### Question 3
Write an `if` statement that reports whether `title1_date` is less than 2000 *and* `title1_most_shelved` is not zero (i.e at has been tagged "most_shelved" at least once)

In [14]:
# Your code here
    print('Title was published before 200 and has been tagged "most_shelved".')

Title was published before 200 and has been tagged "most_shelved".


### Question 4

In [16]:
title2 = 'The Great Gatsby'
title2_author = 'F. Scott Fitzgerald'
title2_date = 1926
title2_average_rating = 3.93
title2_most_shelved = 3
title2_most_read = 8

Combine an`if` statement with an `else` statement that will report whether `title2_average_rating` is greater than 4 or, if not, less than 4

In [17]:
# Your code here
    print('Title has an average rating of greater than 4 stars.')
# Your code here
    print('Title has an average rating of less than 4 stars.')

Title has an average rating of less than 4 stars.


### Question 5

In [19]:
title3 = 'The Giver'
title3_author = 'Lois Lowry'
title3_date = 1993
title3_average_rating = 4.13
title3_most_shelved = 87
title3_most_read = ''

Add an `elif` statement that reports whether `title3_date` is exactly 1993 years old

In [20]:
# Your code here
    print('Title was published before 1993.')
# Your code here
    print('Title was published in 1993.')
# Your code here 
    print('Title was published afer 1993.')

Title was published in 1993.


### Question 6

In [52]:
title2_most_read = 8
title3_most_read = ''

Write an `if` statement that will report whether `title2_most_read` indicates that the title has been tagged most read at least once.

In [22]:
# Your code here
    print('Title has been tagged most_read at least once.')

Title has been tagged most read at least once.


### Question 7
Write a single `if` statement that will accurately report whether both `title2_most_read` and `title3_most_read` indicate that both titles have been tagged "most_read"


>*Hint:*
> Think about how you might use the `!=` operator!  
> And remember that there's a difference between quotation marks with no space `''` and quotation marks with a space`' '`. Python is picky!


In [54]:
#Your code Here for `title2_most_read` and `title3_most_read`
    print('Both titles have been tagged most_read at least once.')

### Question 8

In [36]:
title1_reviews = '100k'

Let's say we we want to check whether `title1_reviews` is greater than 50,000. Try to test out a sample below:

In [25]:
# your code here
    print('Title has more than 50k reviews.')

TypeError: '>' not supported between instances of 'str' and 'int'

What happened? Why did we get a TypeError? We tried to ask an *integer* question of a value that was formatted as a *string* (Notice the `''`?). 

So how do we fix this? First we'll have to clean up our data and replace the `k` with `000` (then we'll change the datatype).

You can use the Python keyword `in` to test whether a string appears within another string. Print `title1_reviews` with the `k` replaced with three zeros.

> *Hint:* Remember the string method `.replace()`?

In [37]:
if "k" in title1_reviews:
    title1_reviews = title1_reviews.replace("k", "000") # This line is replacing the k with 000.
    print(title1_reviews)

100000


Great! Now we need to change the dataype using the `int()` function. Run the cell below.

In [42]:
title1_reviews = int(title1_reviews)

100000

Now, try to check whether `title1_reviews` is greater than 50,000. 

In [43]:
# Your code here:
    print('Title has more than 50k reviews.')

Title has more than 50k reviews.


### Question 9: Reflection
In a few sentences, write a few sentences about the simple methods that you've learned and the dataset that you've begun to explore. 

1. What research questions would you want to ask with these? 
2. What gets left out when we're looking at the Goodreads data in table form? (Another way to ask this question: would you know these reviews or ratings came from Goodreads if you just encountered the table of data (without the introduction)? Are there any aspects about the data (or its origin) that we might want to think about?
3. In the specific case of the Goodreads data, are there other questions that you might want to ask with this data? What other questions or patterns might you want to investigate in the larger dataset? 

**Double-click this cell to type your thoughts here**