# Welcome to Intro to Python

This tutorial has been provided to you by the Student Succes Center at the John Hopkins Carey Business School. If you are unfamiliar with the Python programming language, and would like to learn more about it, you are in the right place. If you stumble upon this tutorial and are wondering if you should go through it or not, you are also in the right place.

This is a short introduction to get you started running and writing Python code in less than an 2 hours. You will not need to install anything right now and will have additional (free) resources to continue your learning journey beyond this lesson. If you have any questions or comments, please send them to ramonprz01@gmail.com and aneuen@jhu.edu.

This introduction assumes no prior knowledge of programming and only basic math skills acquired in high school.

Why should you learn Python?
Python is currently one of the most popular programming languages for data analysis, web applications and websites, and for artificial intelligence. Python is not only fun to learn and use but is also practical in the sense that it allows you to quickly scale your career without having to go back to university for another degree. Recent industry surveys have shown that [python developers](https://neuvoo.com/salary/?job=Python+Developer) make on average \$120,000; [data scientists using python](https://neuvoo.com/salary/?job=Python+data+scientist) make around $140,000 and the list does not stop there.


## Structure of the Tutorial

1. Introduction
    * What is Python?
    * What are the most used Python Libraries?
    * What is Data?
    * What is Data Analysis?
2. Variables and Data Types
3. Math in Python
4. Loading and Manipulating Data
5. Data Analysis
6. Conclusion
7. Keep improving
8. Resources


More information on how to continue your learning journey, plus additional resources, will be provided in section 7 of this tutorial.

### 1. Introduction

* __What is Python?__

As stated on [python.org](https://www.python.org/doc/essays/blurb/), "Python is an interpreted, object-oriented, high-level programming language with dynamic semantics." This is a more specific way of saying that Python is a general purpose programming language that can be used for tasks ranging from building graphical user interface (GUI) applications to data analytics and machine learning. In fact, Python's popularity in recent years has been driven partly due to its versatility and usage in data analytics and machine learning. It is now one of the (if not the) preferred programming language for data scientists, analysts, and researchers.

To run Python code throughout this tutorial all you need to do is to press __Shift-Enter__ and the cell where you are in will be run. Comments in Python will be preceeded by a __#__ sign.

As awesome as Python is, it cannot do everything on its own, or at least with the base functionalities it comes with. This is where the beuty of open source comes in. Because Python is free and [open source](https://opensource.com/resources/what-open-source), Python has a vast amount analysts, researchers and developers contributing to its development through [libraries](https://www.quora.com/What-is-a-Python-library-and-what-can-I-use-it-for) that make it easier for users to get the best out of Python. We will not go in detail through all of Python's libraries in this tutorial, as that would not be feasible, but we will briefly introduce some of the most useful ones for data analysis here.

* __What are some of the most widely used (and important) Python Libraries?__

Python has a plethora of libraries that are used by thousands of users each days. For data analysis, however, the most important ones are NumPy, Pandas, SciKit-Learn, matplotlib, and SciPy. Below you will find a short blurb, the command you use to import each, and the standard aliases used in industry (import "name of library or module" as "nickname").

Note: To load a library into your session you only need to run the cell once.

- _Numpy:_
Numpy is a scientific computing package that allows anyone to do fast computations using vectors, matrices, and basically a lot of linear algebra. The following code would be what you would write and execute in order to import the library into a session.
```python
import numpy as np
```
- _pandas:_
pandas is a library that allows us to load, merge, and manipulate dataframes (think of these as excel spreadsheets) before or while we analyze our data.
```python
import pandas as pd
```
- _scikit-Learn:_
scikit-learn or sk-learn is one of the most widely used libraries for machine learning. Because of its versatily and wide variety of modules (think of these as sub-libraries within sk-learn), one can run almost every machine learning algorithm available with this library.
```python
from sklearn import * # The asteric means import all modules in the library.
```
- _SciPy:_
The [scipy library](https://www.scipy.org/scipylib/index.html) "provides many user-friendly and efficient numerical routines such as routines for numerical integration, interpolation, optimization, linear algebra and statistics."

```python
from scipy import *
```

- _Matplotlib:_
As stated on matplotlib's [website](https://matplotlib.org/), "Matplotlib is a Python 2D plotting library which produces publication quality figures in a variety of hardcopy formats and interactive environments across platforms."
```python
import matplotlib.pyplot as plt
```

* __What is Data?__

<img src="data.jpg">

Computer Hope defines __Data__ in the following way: "data is any set of characters that is gathered and translated for some purpose, usually analysis. It can be any character, including text and numbers, pictures, sound, or video. If data is not put into context, it doesn't do anything to a human or computer."

In essence, everything we do, from breathing to searching on the web, generates information in different formats and all can be classified as data. It is important to notice, however, that most of the data available in the world is unstructured, meanining not in a nice tabular format like we are used to with excel. We'll dive into the different data types in the next lesson.

* __What is Data Analysis?__

The best definition of what data analysis is comes from [wikipedia](https://en.wikipedia.org/wiki/Data_analysis) and it says "Data analysis is a process of inspecting, cleansing, transforming and modeling data with the goal of discovering useful information, informing conclusions and supporting decision-making." Think about it this way, whenever we try to make a decision using evidence we need to put the evidence in a format in which we can first make sense of it, second analyze it, and third, provide actionable insight using it. This is in a way the essence of data analysis and the focus of this tutorial.


We will now go ahead and import the libraries we will be using in this tutorial.

In [13]:
# Let's import the libraries we will use throughout this tutorial.

import numpy as np
import pandas as pd

#### Exercise 1
Can you import matplilib below with its industry alias. We will need it later on.

### 2. Variables and Data Types

- __Variables__

Variables are the fuel of any data analysis, and these same variables can represent a multitude of data types (e.g. word characters "a, e, i, o, u", decimal numbers "10.65", integers "1, 2, 3", etc.). To paint a better picture of what variables are, think of an excel spreadsheet containing different characteristics of a building as the columns (e.g. floor area, address, price, # of rooms/floor, etc.) and different buildings in each row. The columns in this scenario will be the variables, and each number matching a row and a column is a data point showing the specifict characteristic of a building. This is what we refer to as tabular data or, in other words, a dataframe. Groups of variables contanining information can form a dataframe, and these variables can have none to lots of information in them. The picture below shows the example just mentioned.

<img src="pic1.jpg">

Say we want to create a variable in python with just one data point

In [8]:
x = 5

Congrats! You just ran your first Python code.

The letter x will hold the value of 5 for us and keep it safe until we tell it to do something with it. Let's give x a friend named y.

In [9]:
y = 7

To observe the values of a variable or any other data type available during the session, we will use the command print().

In [14]:
print(x, y)

5 7


Another important distinction between variables and dataframes is that both will be held with the same convention, meaning a variable like x could also represent a dataframe or any other data type we assign it to. What you need to keep in mind is that when we are dealing with tabular data or a dataframe, a variable will be any of the columns of the dataframe, but when we are just manipulating different data types to do computations, a variable can be anything.

Lastly, variable in a dataframe or in isolation can be thought of representations of a one-dimensional vector, and the data frame can be thought of a combination of vectors, a multidimensional vector, or a matrix. We will see later on how to create, manipulate, and work with vectors, and dataframes in python.

Here are two ways of creating vector and a dataframe.

In [25]:
# First vector as a numpy array. These are my friends' birthdays.
vec1 = np.array([12, 21, 30, 8, 23])

# Second vector as a pandas series (this is just an array). These are my friends.
vec2 = pd.Series(['Doc', 'Chloe', 'Juanky', 'Arelis', 'Trisha'])

# Here we are combining both into a dataframe or a spreadsheet-like object.
dataframe = pd.DataFrame(data=[vec1, vec2]).T

# We want the columns to have names so we add them here.
dataframe.columns = ['birthday', 'friend']

# This line shows us our new dataframe consisting of two vectors we created ourselves.
print(dataframe)

  birthday  friend
0       12     Doc
1       21   Chloe
2       30  Juanky
3        8  Arelis
4       23  Trisha


- __Data Types__

There are all sorts of data types that data analysts, scientists, and engineers use to represent the world around us. The most useful data types for our purposes will be integers, floats (e.g. a number with decimals or a fraction), booleans (e.g. True, False or 1 and 0, respectively), strings (e.g. word characters), and time (e.g. month, year, hour, minutes, etc.). Another data type in Python, and probably one of the most important ones, are lists. While variables in a dataframe can represent only one data type (e.g. and integer or a string), a variable containing lists can have multiple data types in it. Let's look at all of them separately.

In [27]:
# This is an integer
type_1 = 0
print(type(type_1))

<class 'int'>


In [29]:
# This is a float
type_2 = 1.75
print(type(type_2))

<class 'float'>


In [30]:
# This is a boolean
type_3 = True
print(type(type_3))

<class 'bool'>


In [31]:
# This is a string
type_4 = 'Hello World!'
print(type(type_4))

<class 'str'>


Remember, a vector or the column of a dataframe can be of only one datatype. This is an essential fact to keep in mind, especially when we begin analysing the data.

### 3. Math in Python

Python supports all kinds of calculations and data analyses and in this section, we'll go over all of the basic operations before diving deeper into data analysis.

In [5]:
# Let us begin with PEMDAS (Parenthesis, Exponents, Multiplication, Division, and Subtraction)

In [6]:
print(x)

[1 2 3]


In [7]:
print(type(x))

<class 'numpy.ndarray'>


### 8. Resources

- To learn more about Numpy https://numpy.org/
- To learn more about pandas