# Welcome to Python 0 : An Introduction to Data Science

Please sign in if you have not already. We use this data to improve our workshops!

### About UF DSI

We are an multi- and inter- disciplinary student organization that is dediated to promoting Data Science here at the Univeristy of Florida. We are partnered with the UF Informatics Institute who's aim is to foster informatics research and education.

### What is Python?

Python is an easy-to-use and robust **Object-Oriented** programming language. A lot of new software application are built with Python for this reason. It is used in other areas of computer science such and software engineering, digital arts, cybersecurity, and of course Data Science! 

This is a workshop that will introduce you to the basics of python and introduce you to Data Science and Visualization in Python. Due to the breadth of the language there are still many topics left for you to explore! Here we teach you the necessary skills. 

## Variables and Types

#### Calculator

Python can be used as a calculator. <code>Shift+Enter</code> runs the code block so you don't have to click run every time

In [31]:
# Addition and Subtraction


In [32]:
# Multiplication and Division


In [33]:
# Exponentation


Variables can be given alphanumeric names beginning with an underscore or letter.  Variable types do not have to be declared and are inferred at run time.

In [34]:
# Int


In [35]:
# Float

Strings can be declared with either single or double quotes.

In [36]:
# Strings

## Modules and Import
Files with a .py extension are known as Modules in Python.  Modules are used to store functions, variables, and class definitions.  

Modules that are not part of the standard Python library are included in your program using the <code>import</code> statement.

In [37]:
# To use Math, we must import it


Whoops.  Importing the <code>math</code> module allows us access to all of its functions, but we must call them in this way

In [38]:
# Whole.part


Alternatively, you can use the <code>from</code> keyword

In [39]:
# From with pi


Using the <code>from</code> statement we can import everything from the math module.  

Disclaimer: many Pythonistas discourage doing this for performance reasons.  Just import what you need

In [40]:
# From ... *


## Strings
As you may expect, Python has a powerful, full featured string module.  

### Substrings
Python strings can be substringed using bracket syntax

In [41]:
# Print 1


Python is a 0-index based language.  Generally whenever forming a range of values in Python, the first argument is inclusive whereas the second is not, i.e. <code>mystring[11:25]</code> returns characters 11 through 24.

You can omit the first or second argument

In [42]:
# Characters before 9th

In [12]:
# Characters after 27th

In [13]:
# Omitting start and end

Using negative values, you can count positions backwards

In [14]:
# Print almost last 4 characters

### String Functions
Here are some more useful string functions
#### find

In [15]:
# Find "Gators"

Looks like nothing was found.  -1 is returned by default.

#### lower and upper

#### split

#### join

The <code>join</code> is useful for building strings from lists or other iterables.  Call <code>join</code> on the desired separator

In [73]:
# Join with spaces

For more information on string functions:

https://docs.python.org/2/library/stdtypes.html#string-methods


## Lists
The Python standard library does not have traditional C-style fixed-memory fixed-type arrays.  Instead, lists are used and can contain a mix of any type.

Lists are created with square brackets []

In [17]:
# mylist list of 5

In [18]:
# append 6

In [20]:
# insert the number 7 at index 6


In [21]:
# removes the first matching occurence 


In [22]:
# by default, the last item in the list is removed and returned


In [24]:
# len()

In [26]:
# default list sorting. When more complex objects are in the list, arguments can be used to customize how to sort


In [27]:
# reverse the list


For more information on Lists:

https://docs.python.org/2/tutorial/datastructures.html#more-on-lists

## Conditionals
Python supports the standard if-else-if conditional expression. REMEMBER TO INDENT

## Loops
Python supports for, foreach, and while loops
### For (counting)
Traditional counting loops are accomplished in Python with a combination of the <code>for</code> key word and the <code>range</code> function

In [43]:
#with one argument, range produces integers from 0 to 9

In [45]:
# with three arguments, range starts at 1 and goes in steps of 3 until greater than 12

### Foreach
As it turns out, counting loops are just foreach loops in Python.  The <code>range</code> function returns a list of integers over which <code>for in</code> iterates.  This can be extended to any other iterable type

In [49]:
# iterate over a list of strings

## Functions
Functions in Python do not have a distinction between those that do and do not return a value.  If a value is returned, the type is not declared.

Functions can be declared in any module without any distinction between static and non-static.  Functions can even be declared within other functions

The syntax is the following

In [50]:
# define function

In [51]:
# define player, name, number

Functions can have optional arguments if a default value is provided in the function signature

In [52]:

    
# no team argument supplied

In [53]:
# supplying all three arguments

Python functions can be called using named arguments, instead of positional

### return
In Python functions, an arbitrary number of values can be returned

In [57]:
# def sum, return a + b

# Data Science Tutorial

Now that we've covered some Python basics, we will begin a tutorial going through many tasks a data scientist may perform.  We will obtain real world data and go through the process of auditing, analyzing, visualing, and building classifiers from the data.

We will use a database of selected professor salaries which can be found using this link: 
https://vincentarelbundock.github.io/Rdatasets/csv/car/Salaries.csv
## Obtaining the Data
Using the pandas library we can easily import data from a given link or from a file on our computer (must know syntax for filepath). In this case we will give it a link.

In [1]:
#import pandas and load dataset into a frame
# import the module and alias it as pd

# show the first few rows of the data

Lets take a look at some simple statistics for the **yrs.since.phd** column

In [62]:
#describe salary column

<code>salary_data.mean().round()</code> will take the mean of each column (this computation ignores the currently present nan values), then round, and return a dataframe indexed by the columns of the original dataframe.

This function can be used to replace all missing values with the mean of each column. In this tutorial however, we will not use this method, because the large number of missing values would greatly skew our standard deviations.

In [68]:
#find mean values for imputing

In [70]:
#check unique values

Structurally, Pandas dataframes are a collection of Series objects sharing a common index.  In general, the Series object and Dataframe object share a large number of functions with some behavioral differences.  In other words, whatever computation you can do on a single column can generally be applied to the entire dataframe.

Now we can use the dataframe version of <code>describe</code> to get an overview of all of our data

In [71]:
#overview description of data frame

## Visualizing the Data
Another important tool in the data scientist's toolbox is the ability to create visualizations from data.  Visualizing data is often the most logical place to start getting a deeper intuition of the data.  This intuition will shape and drive your analysis.

Even more important than visualizing data for your own personal benefit, it is often the job of the data scientist to use the data to tell a story.  Creating illustrative visuals that succinctly convey an idea are the best way to tell that story, especially to stakeholders with less technical skillsets.

We'll be using the plotting library matplotlib, which stands for mathematical plotting library. It is the most widely used plotting library, and has a few other packages built on top of it (like a library called seaborn) to make your plots even more beautiful and easy to use. 

We'll start by doing a bit of setup.

In [None]:
#importing matplotlib library with an alias as well as the seaborn library
import matplotlib.pyplot as plt
import seaborn as sns

sns.set(style = 'darkgrid', color_codes = True)   # my personal style preferences

# hack to make seaborn plots bigger on jupyter notebooks (don't worry about this)
def setPlt():
    f, ax = plt.subplots(figsize = (13,9))
    sns.despine(f, left = True, bottom = True)

Let's go ahead and start with a histogram of the years since the professors got their phd using the distplot() function. 

In [None]:
#create our first plot, a histogram of salaries

Visualization is all about asking questions of the data. One thing that we could be curious about is how the pay changes as people have had their phd for longer. We can make a scatterplot of exactly that using the scatter function. 


Let's do a scatter plot of yrs.since.phd vs salary

In [None]:
#scatter plot

Seems like there are some people who have had their Ph.D for a while but still dont get paid as much. Does the same hold true with how long they've worked?

In [None]:
#scatter plot with years of service

We can also color our graph fairly easily, let's compare the years since Phd to thier title to see the distribution of salary.

In [None]:
#colored scatter plot

## Summary

So far in our three-part Python series, we've learned about variables, data structures, functions, and graphing. While we have introduced these topics in the context of data science with Python, they are central to programming in any language and in any context. 


### Data Science in a Nutshell
We believe that data science has the potential to revolutionalize the way we understand our world. Anyone can learn the tools of Data Science in order to ensure success. Our goal is to give you these tools and create a community of data scientists here at UF.

We hope you enjoed the workshop and look forward to seeing you soon!

#### Visit our website if you want to get involved with DSI:  http://www.dsiufl.org

# Thank You !