# Lab 1: Python Review, Intro to Data Science, and Causality &amp; Experiments

Welcome to Test Lab 1 !

In this lab, we will go over:

   #### 1) Python Review
   #### 2) Intro to Data Science
   #### 3) Causality &amp; Experiments

# 1. Python Review


## Errors
Python is a language, and like natural human languages, it has rules.  It differs from natural language in two important ways:
1. The rules are *simple*.  You can learn most of them in a few weeks and gain reasonable proficiency with the language in a semester.
2. The rules are *rigid*.  If you're proficient in a natural language, you can understand a non-proficient speaker, glossing over small mistakes.  A computer running Python code is not smart enough to do that.

Whenever you write code, you'll make mistakes.  When you run a code cell that has errors, Python will sometimes produce error messages to tell you what you did wrong.

Errors are okay; even experienced programmers make many errors.  When you make an error, you just have to find the source of the problem, fix it, and move on.

Errors are in the cells below. Run it and fix the errors.

In [11]:
#cout('My name is Billy Bob')

print('My name is Billy Bob')

My name is Billy Bob


In [12]:
#print("My name is Bob Billy"

print('My name is Bob Billy')

My name is Bob Billy


In [13]:
#three = 3 + zero

three = 3 + 0

In [14]:
#3 = [ + 2

three = 1 + 2

In [15]:
#2 + 1 = three

three = 2 + 1

Looking at the various errors, you should be able to correct them based on the message given.

Here are a list of errors you may run into:

   1) **IndexError** is thrown when trying to access an item at an invalid index.
   
   2) **ModuleNotFoundError** is thrown when a module could not be found.
   
   3) **KeyError** is thrown when a key is not found.
   
   4) **ImportError** is thrown when a specified function can not be found.
   
   5) **StopIteration** is thrown when the next() function goes beyond the iterator items.
   
   6) **TypeError** is thrown when an operation or function is applied to an object of an inappropriate type.
   
   7) **ValueError** is thrown when a function's argument is of an inappropriate type.
   
   8) **NameError** is thrown when an object could not be found.
   
   9) **ZeroDivisionError** is thrown when the second operator in the division is zero.
   
NameError, TypeError, and ValueErrors tend to be the most common.
   
   
**Note:** In the toolbar, there is the option to click Cell &gt; Run All, which will run all the code cells in this notebook in order. However, the notebook stops running code cells if it hits an error


## Data Structures

In data science, the primary data structures you will be using are **lists** and **dictionaries.** Below are some helpful tables to keep in mind.



**Python List Methods**
![Python%20List%20Methods.PNG](attachment:Python%20List%20Methods.PNG)



**Python Dictionary Methods**
![Python%20Dictionary%20Methods.PNG](attachment:Python%20Dictionary%20Methods.PNG)



**Question 1:** Create a list of all the different types of errors that might occur. Name your list `error_list` and print your output.

In [16]:
error_list = ["indexerror", "modulenotfounderror", "keyerror", "importerror", "stopiteration", "typeerror", "valueerror", "nameerror", "zerodivisionerror"]
print(error_list)

['indexerror', 'modulenotfounderror', 'keyerror', 'importerror', 'stopiteration', 'typeerror', 'valueerror', 'nameerror', 'zerodivisionerror']


**Question 2:** Remove the **most common errors** from the `error_list` that you created in question one. Print your output.

In [18]:
error_list.remove("nameerror")
error_list.remove("typeerror")
error_list.remove("valueerror")
print(error_list)

['indexerror', 'modulenotfounderror', 'keyerror', 'importerror', 'stopiteration', 'zerodivisionerror']


**Question 3:** Create a dictionary called `dict_methods` from the dictionary table methods photo above. The keys in `dict_methods` should be the method **names**, and the corresponding values should be the **explanation**. Print your output

In [22]:
dict_methods = {'keys':'returns the keys of the dictionary in a dict_keys object', 'values':'returns the values of the dictionary in a dict_values object', 'items':'return the key-pairs in a dict_items object', 'get':'returns the values associated with k,none otherwise', '[]':'returns the value associated with k, otherwise it\'s an error', 'in':'returns true if key is in the dictionary, false otherwise', 'del':'removes the entry from the dictionary'}
print(dict_methods)

{'keys': 'returns the keys of the dictionary in a dict_keys object', 'values': 'returns the values of the dictionary in a dict_values object', 'items': 'return the key-pairs in a dict_items object', 'get': 'returns the values associated with k,none otherwise', '[]': "returns the value associated with k, otherwise it's an error", 'in': 'returns true if key is in the dictionary, false otherwise', 'del': 'removes the entry from the dictionary'}


**Question 4:** Remove the entry in `dict_methods` that returns boolean values. Print your output.

In [23]:
del dict_methods['in']
print(dict_methods)

{'keys': 'returns the keys of the dictionary in a dict_keys object', 'values': 'returns the values of the dictionary in a dict_values object', 'items': 'return the key-pairs in a dict_items object', 'get': 'returns the values associated with k,none otherwise', '[]': "returns the value associated with k, otherwise it's an error", 'del': 'removes the entry from the dictionary'}


## Defining functions

A function is a block of code which only runs when it is called. You can pass data, known as parameters, into a function. A function can return data as a result.

Let's write a very simple function that converts a proportion to a percentage by multiplying it by 100.  For example, the value of `to_percentage(.5)` should be the number 50 (no percent sign).

A function definition has a few parts.

##### `def`
It always starts with `def` (short for **def**ine):

    def

##### Name
Next comes the name of the function.  Like other names we've defined, it can't start with a number or contain spaces. Let's call our function `to_percentage`:
    
    def to_percentage

##### Signature
Next comes something called the *signature* of the function.  This tells Python how many arguments your function should have, and what names you'll use to refer to those arguments in the function's code.  A function can have any number of arguments (including 0!). 

`to_percentage` should take one argument, and we'll call that argument `proportion` since it should be a proportion.

    def to_percentage(proportion)
    
If we want our function to take more than one argument, we add a comma between each argument name. Note that if we had zero arguments, we'd still place the parentheses () after than name. 

We put a colon after the signature to tell Python it's over. If you're getting a syntax error after defining a function, check to make sure you remembered the colon!

    def to_percentage(proportion):

##### Documentation
Functions can do complicated things, so you should write an explanation of what your function does.  For small functions, this is less important, but it's a good habit to learn from the start.  Conventionally, Python functions are documented by writing an **indented** triple-quoted string:

    def to_percentage(proportion):
        """Converts a proportion to a percentage."""
    
    
##### Body
Now we start writing code that runs when the function is called.  This is called the *body* of the function and every line **must be indented with a tab**.  Any lines that are *not* indented and left-aligned with the def statement is considered outside the function. 

Some notes about the body of the function:
- We can write code that we would write anywhere else.  
- We use the arguments defined in the function signature. We can do this because we assume that when we call the function, values are already assigned to those arguments.
- We generally avoid referencing variables defined *outside* the function. If you would like to reference variables outside of the function, pass them through as arguments!


Now, let's give a name to the number we multiply a proportion by to get a percentage:

    def to_percentage(proportion):
        """Converts a proportion to a percentage."""
        factor = 100

##### `return`
The special instruction `return` is part of the function's body and tells Python to make the value of the function call equal to whatever comes right after `return`.  We want the value of `to_percentage(.5)` to be the proportion .5 times the factor 100, so we write:

    def to_percentage(proportion):
        """Converts a proportion to a percentage."""
        factor = 100
        return proportion * factor
        
`return` only makes sense in the context of a function, and **can never be used outside of a function**. `return` is always the last line of the function because Python stops executing the body of a function once it hits a `return` statement.

*Note:*  `return` inside a function tells Python what value the function evaluates to. However, there are other functions, like `print`, that have no `return` value. For example, `print` simply prints a certain value out to the console. 

`return` and `print` are **very** different. 

**Question 1:** Define a function called `remove_vowels`. It should take a single string as its argument. (You can call that argument whatever you want.) It should return a copy of that string, but without the vowels. In addition, count how many vowels are removed and print the number out.

**Note:** One of the test strings will be the word "supercalifragilisticexpialidocious." 

In [30]:
def remove_vowels():
    oldstr = "supercalifragilisticexpialidocious."
    newstr = ""
    vowels = 0
    for i in oldstr:
        if i == "a" or i == "e" or i == "i" or i == "o" or i =="u":
            vowels += 1
        else:
            newstr += i
    print('number of vowels: ', vowels)
    return newstr

print(remove_vowels())

number of vowels:  16
sprclfrglstcxpldcs.


# 2. Intro to Data Science

**Data** is information, especially facts or numbers, usually collected or computed for purposes of analysis. 

**Data Science** is about drawing useful conclusions from large and diverse data sets through exploration, prediction, and inference.

Types of analysis done in Data Science include:

1) **Descriptive:** describes data, providing insight and knowledge.
    
2) **Predictive:** makes predictions from data.
    
3) **Prescriptive:** makes decisions (prescriptions) based on data. 

Refer to the image below for an example:![Image%20for%20TestLab2.PNG](attachment:Image%20for%20TestLab2.PNG)

# Two types of classifications for data: quantitative and categorical.


## Quantitative

**Quantitative** data takes on a numeric value that can be measured and ordered.

Example: An employee's salary is quantitative data.


### Types of Quantitative Variable's

A **continuous variable's** values are infinite along a continuum of values within a range, typically real numbers. Continuous variables usually represent measurements, like height (0.00104 meters) or temperature (98.6 degrees).

A **discrete variable's** values are finite within a range, typically integers. Discrete variables usually represent countable items, like people in a family (5) or cars in a city (502,434). 

Generally, if "number of" can be added to the beginning, the variable is discrete, like "number of people in a family", but not "number of height". Note: "Discrete" means separate or distinct, not to be confused with "discreet" which means careful or unobtrusive.

## Categorical

**Categorical** data takes on the value (usually a label) of one of several categories. This data is also called **qualitative** data as well.

Example: The type of occupation someone is in is categorical data.

### Types of categorical variables

Two types of categorical variables are often distinguished:

A **nominal variable's** categories have no ordering, existing in name only, like apples, oranges, and grapes. ("Nominal" means "in name only").

An **ordinal variable's** categories have an ordering, like disagree, neutral, and agree.

**Question:** Create a dictionary called `types_of_data` with keys being the **two types of classification for data.** The corresponding values for each key should be a list of the different variable types the key can have. Print your output below.

In [32]:
types_of_data = {'nominal':['bananas','kiwis','lemons'],'ordinal':[1, 2, 3]}
print(types_of_data)

{'nominal': ['bananas', 'kiwis', 'lemons'], 'ordinal': [1, 2, 3]}


# 3. Causality

## Job Opportunities &amp; Education in Rural India
A study at UCLA investigated factors that might result in greater attention to the health and education of girls in rural India. One such factor is information about job opportunities for women. The idea is that if people know that educated women can get good jobs, they might take more care of the health and education of girls in their families, as an investment in the girls’ future potential as earners. Without the knowledge of job opportunities, the author hypothesizes that families do not invest in women’s well-being.

The study focused on 160 villages outside the capital of India, all with little access to information about call centers and similar organizations that offer job opportunities to women. In 80 of the villages chosen at random, recruiters visited the village, described the opportunities, recruited women who had some English language proficiency and experience with computers, and provided ongoing support free of charge for three years. In the other 80 villages, no recruiters visited and no other intervention was made.

At the end of the study period, the researchers recorded data about the school attendance and health of the children in the villages.

**Question 1:** Which statement best describes the treatment and control groups for this study? Assign either 1, 2, or 3 to the name jobs_q1 below.

1) The treatment group was the 80 villages visited by recruiters, and the control group was the other 80 villages with no intervention.

2) The treatment group was the 160 villages selected, and the control group was the rest of the villages outside the capital of India.

3) There is no clear notion of treatment and control group in this study.

In [33]:
jobs_q1 = 1

**Question 2:** Was this an observational study or a randomized controlled experiment? Assign either 1, 2, or 3 to the name jobs_q2 below.

1) This was an observational study.

2) This was a randomized controlled experiment.

3) This was a randomized observational study.

In [34]:
jobs_q2 = 2

**Question 3:** The study reported, “Girls aged 5-15 in villages that received the recruiting services were 3 to 5 percentage points more likely to be in school and experienced an increase in Body Mass Index, reflecting greater nutrition and/or medical care. However, there was no net gain in height. For boys, there was no change in any of these measures.” Why do you think the author points out the lack of change in the boys? Type your answer and store it in `job_q3`

Hint: Remember the original hypothesis. The author believes that educating women in job opportunities will cause families to invest more in the women’s well-being.

In [35]:
jobs_q3 = "Pointing out the lack in difference between the boys between all the villages could suggest that this study really is directed for women as they're the oppressed sex that is being held back."

## Nearsightedness Study

Myopia, or nearsightedness, results from a number of genetic and environmental factors. In 1999, Quinn et al studied the relation between myopia and ambient lighting at night (for example, from nightlights or room lights) during childhood.

**Question 1:** The data were gathered by the following procedure, reported in the study. “Between January and June 1998, parents of children aged 2-16 years [...] that were seen as outpatients in a university pediatric ophthalmology clinic completed a questionnaire on the child’s light exposure both at present and before the age of 2 years.” Was this study observational, or was it a controlled experiment? Explain.

Write your answer in `myopia_q1` below.


In [36]:
myopia_q1 = "This was an observation experiment because there was no independent variable or control. This was simply a pattern that was noticed."

**Question 2:** The study found that of the children who slept with a room light on before the age of 2, 55% were myopic. Of the children who slept with a night light on before the age of 2, 34% were myopic. Of the children who slept in the dark before the age of 2, 10% were myopic. The study concluded that, "The prevalence of myopia [...] during childhood was strongly associated with ambient light exposure during sleep at night in the first two years after birth."

Do the data support this statement? You may interpret “strongly” in any reasonable qualitative way.

Write your answer in `myopia_q2` below.


In [37]:
myopia_q2 = "The data supports the statement, but it doesn't have heavy credibility considering this was just an observation and there aren't any scientific explanations that've been included."

**Question 3:** On May 13, 1999, CNN reported the results of this study under the headline, “Night light may lead to nearsightedness.” Does the conclusion of the study claim that night light causes nearsightedness?

Write your answer in `myopia_q3` below.

In [38]:
myopia_q3 = "This seems to be like a statement that is made too prematurely. Not enough information/proof has been given to guarantee that the statement will always be right."

**Question 4:** The final paragraph of the CNN report said that “several eye specialists” had pointed out that the study should have accounted for heredity.

Myopia is passed down from parents to children. Myopic parents are more likely to have myopic children, and may also be more likely to leave lights on habitually (since the parents have poor vision). In what way does the knowledge of this possible genetic link affect how we interpret the data from the study?

Write your answer in `myopia_q4` below.

In [39]:
myopia_q4 = "This provides a counterexample to the hypothesis. This could either take away the solution as a whole, or contribute to it."

## Studying the Survivors
The Reverend Henry Whitehead was skeptical of John Snow’s conclusion about the Broad Street pump. After the Broad Street cholera epidemic ended, Whitehead set about trying to prove Snow wrong. (The history of the event is detailed here.)

He realized that Snow had focused his analysis almost entirely on those who had died. Whitehead, therefore, investigated the drinking habits of people in the Broad Street area who had not died in the outbreak.

What is the main reason it was important to study this group?

1) If Whitehead had found that many people had drunk water from the Broad Street pump and not caught cholera, that would have been evidence against Snow's hypothesis.

2) Survivors could provide additional information about what else could have caused the cholera, potentially unearthing another cause.

3) Through considering the survivors, Whitehead could have identified a cure for cholera.


**Note:** Whitehead ended up finding further proof that the Broad Street pump played the central role in spreading the disease to the people who lived near it. Eventually, he became one of Snow’s greatest defenders.

In [40]:
survivor_answer = "All are correct but the fiest statement is most important. It is just as important to study the control because the comparison can confirm or deny a hypothesis."