# <span style="color:darkblue" fontsize = 500> Lecture 1: Introduction to Jupyter Notebooks </span>

<font size="5">

The basic structure for running Python for data projects
<img src="figures/project_flow.png" alt="drawing" width="650"/>
- Python is a general purpose language
- Researchers and practitioners add new functionalities all the time
- New features are included as libraries on top of the "basic" installation

***

### The basic structure for running Python for data projects
<img src="figures/project_flow.png" alt="drawing" width="650"/>

- ### Python is a general purpose language
- ### Researchers and practitioners add new functionalities all the time
- ### New features are included as libraries on top of the "basic" installation

# <span style="color:darkblue"> STEP 0: Preliminaries </span>

<font size="5"> 

- A Virtual Environment is an isolated **directory/workspace** (folder in your computer) <br>
that contains a specific **collection of packages**

- A package is a folder containing a set of Python scripts or <br>
modules which allow you to accomplish a defined task <br> 
(visualization, analysis, mathematical operations, etc.)

<font size = "5">

To manage packages open "Anaconda Navigator" on your <br>
computer and go to the "Environment" Tab

<img src="figures/anaconda_navigator_installed.png" alt="drawing" width="650"/>

<font size = "5">

In the future, as your data analysis needs expand, <br>
you way want to click on the "Not installed" packages <br>
to download cool new packages!

<img src="figures/anaconda_navigator_notinstalled.png" alt="drawing" width="650"/>

<font size = "5">

Note: Anaconda does not list the default packages in <br>
Python's **standard** library.  You have access to these too!<br>
See [The Python Standard Library](https://docs.python.org/3/library/index.html) to see what's included. 

# <span style="color:darkblue"> STEP 1: Setup Working Environment </span>

<font size="5"> 

(a) Double Check that Python is linked to VS Code

<img src="figures/python_kernel.png" alt="drawing" width="650"/>

- If not already linked, it will say "Select Kernel"
- Click button, choose "Python Environments", then select <br>
the version of Python that contains the word "anaconda"


<font size="5"> 

(b) Try some basic Python commands
- use "print" to display a message and some basic calculations

In [2]:
print("Hello World!")
print(2 + 3)
print(3*4)
print(2**3) # 2 raised to the third power

Hello World!
5
12
8


### Let's try and compute $\log_2(8)$ (which equals 3). 
### We will get our first error message!

In [3]:
print(log2(8))

NameError: name 'log2' is not defined

<font size = "5">

(c) Import Packages (a.k.a. libraries):

- Jupyter notebooks launches with very basic options
- The "import" command adds libraries to the working enviroment. 
- Once imported, use "." to run subcommands contained in the library

In [4]:
import math
print(math.log2(8))

3.0


### Let's import the "statistics" library

In [5]:
import statistics

print(statistics.mean([1, 7, 3, 5, 9, 1, 12]))
print(statistics.median([1, 7, 3, 5, 9, 1, 12]))
print(statistics.geometric_mean([1, 7, 3, 5, 9, 1, 12]))

5.428571428571429
5
3.795163026589841


In [6]:
math.log2(16)

4.0

<font size = "5">

(d) Import Packages with nicknames:

- Typing "statistics" every single time you use that package can be a pain
- To create nice plots, we will be using `matplotlib.pyplot` - long name!
- Luckily, we can give the libraries a nickname with "as"
- We will also use `pandas` - library for working with datasets

In [7]:
stats.mean([1, 2, 3])

NameError: name 'stats' is not defined

In [None]:
# This cell is a code cell. But adding "#" at the start of a line makes it a comment.
# Comments are ignored by Python, but are useful for humans to understand the code.

# Notes about nicknames:
# - Import "statistics", but give it the nickname "stats"
# - "matplotlib.pyplot" is a long name. Let's call it "plt"
# - Similarly, let's call "pandas" as "pd"
# - Try adding your own nickname!
# - To avoid errors, be consistent with your nicknames

import statistics as stats
import matplotlib.pyplot as plt
import pandas as pd


print(stats.mean([1, 7, 3, 5, 9, 1, 12]))

5.428571428571429


<font size="5"> 

(e) Open datasets

Run the command "read_csv" from the library <br>
"pandas" (nicknamed "pd"). 


In [9]:

# The subcommand "read_csv()" opens the file in parenthesis.
# We use the "=" symbol to store the dataset in the working environment under the name "carfeatures"

carfeatures = pd.read_csv('data/features.csv')

<font size="5"> 

You can open the datasets in the current environment
- Click on the "Jupyter Variables" button in the top bar to open a panel

<img src="figures/topbar.png" alt="drawing" width="650"/>

- Click on the icon to the left of "carfeatures" in the "Jupyter: Variables" tab

<img src="figures/jupyter_var.png" alt="drawing" width="700"/>

DataWrangler will open a window showing the data
- Each row is an observation (a car)
- Each column is the value of a variable (a feature of that car)

***


# <span style="color:darkblue"> STEP 2: Run Analyses </span>

<font size="5"> 

Output data for all the columns

In [None]:
# Entering the name of a dataframe produces an output with some rows

carfeatures

<font size="5"> 

Output data for a single column 'cylinders'

In [None]:
# We use square brackets [...] to subset information from data 
# Text/strings have to be written in quotation marks
# This command extracts the column 'cylinders'

carfeatures["cylinders"]


<font size="5"> 

Example: Compute a frequency table

In [None]:
# crosstab counts how many rows fall into categories
# "index" is the category
# "columns" is a custom title

table = pd.crosstab(index = carfeatures['cylinders'],columns = "count")
table


### The "help" keyword can be used to learn more about a command

In [None]:
help (pd.crosstab)

In [None]:
# It looks like the first two arguments are "index" and "columns", which are required. The rest are optional.
# So we can also write the command like this:

table = pd.crosstab(carfeatures['cylinders'], "count")
table



### We can also cross-tabulate between two columns of carfeatures:

In [None]:
table_2 = pd.crosstab(index = carfeatures['cylinders'],columns = carfeatures['mpg'])
table_2

<font size="5"> 

Example: Compute basic summary statistics for all variables

In [None]:
# "describe" computes the count, mean, std, min, 25% quantile, 50%, 75%, max
# automatically excludes variables with text values
# otherwise includes all numeric variables

carfeatures.describe()

<font size="5"> 

Example: Display a scatter plot 

In [None]:
plt.scatter(x = carfeatures['weight'], y = carfeatures['mpg'])
plt.show()

### Q: Is this a good plot?

In [None]:
# Try another scatter plot with x = "acceleration"






# <span style="color:darkblue"> Pro Tips: How to be a great student for QTM 151?
 </span>

<font size="5"> 

- Ask clarifying questions, e.g.

    -  Can you explain what this command is doing? --> **I don't mind repeating an explanation!**
    -  What are the arguments of this function?
    -  What is the output?
    -  I get an error saying .... (be explicit), what could be the issue?

<font size="5"> 

- Remember that good coders ...

    -  build up their toolkit of commands over time
    -  understand that errors are normal the first time you run a command
    -  **learn to use online websites to interpret errors!!**, https://stackoverflow.com/questions/tagged/python
    -  search help pages to find proper syntax, e.g. https://www.w3schools.com/python/


<font size="5"> 

- Experiment

    -  If we do analyses for variable "A", try it for "B"
    -  Search online how to do something extra, e.g. change the color of a scatter plot
    -  Try running the syntax deliberately wrong: helps you get more familiar with error messages
    -  Think long term: Figuring out a puzzle today, means that you can use the code for the next time!

<font size="5"> 

- Come to office hours

    -  Best time for a one-on-one!
    -  Good place to ask about topics not covered in the lecture
