# Introduction to Data Science (CS4661). Cal State Univ. LA, CS Dept.
#### Instructor: Dr. Mohammad Porhomayoun
---------------------------------------------------------------------------------------------------
---------------------------------------------------------------------------------------------------


# Data Science in Python

#### This is an introduction to some data sceince libraries/packages in python. Feel free to refer to the suggested resources and documentaries for more details.

---------------------------------------------------------------------------------------------------
---------------------------------------------------------------------------------------------------


## Let's start working with Python Libraries and Packages:
To use a Library/Package in python, we first need to import it:


In [None]:
# To use a Library/Package in python, we first need to import it:

# import ...
# import ... as ...
# From ... import ...

# Numpy Package:

In [None]:
# Numpy is a library for performing advanced mathematical operations on Arrays

import numpy as np # it means that we can use abbreviation "np" from now on to refer to numpy functions 


In [None]:
avg = np.mean([1,3,5,4,2,6,7,1,-5,4,2])  # mean (average) using numpy function "mean"
print (avg)

st = np.std([1,3,5,4,2,6,7,1,-5,4,2])    # standard deviation using numpy function "std"
print (st)

##### We will talk more about numpy in next tutorials!

# Pandas Package:
#### Pandas introduces two new data structures to Python: Series and DataFrame.
- A Series is a one-dimensional vector similar to an array, list, or column in a table.
- A DataFrame is a structured data table (or Matrix) comprised of rows and columns.

#### Pandas is a powerful library to read, manipulate, and process data in the form of Series and DataFrame structures. Pandas allows working with Big Data. Pandas DataFrames are identical to "R" DataFrames.

In [None]:
# The first thing to do is to import the pandas library:
# Pandas is a powerful library to read and process data tables

import pandas as pd

### Creating a Series from an arbitrary python list

In [None]:
# creating a Series from an arbitrary python list
s = pd.Series([1,20,32,12,-6.5])
print(s)

In [None]:
print(s[3])

### Creating a DataFrame

In [None]:
# creating and empty DataFrame:
df = pd.DataFrame()

# Adding new columns:
df['first_name'] = ['Tom','Ryan','Jack','Sarah']
df['Gender'] = ['male','male','male','female']

print(df)

In [None]:
# Adding new columns:
df['age'] = [45,34,32,43]

print(df)

### Creating a DataFrame from an arbitrary python dictionary

In [None]:
# creating a DataFrame from an arbitrary python dictionary

dic = {'first_name':['Tom','Ryan','Jack','Sarah'],'Gender':['male','male','male','female']}

df = pd.DataFrame(dic)

print(df)

In [None]:
# creating a DataFrame from a dictionary and defining the columns and order of them:

df = pd.DataFrame(dic,columns=['first_name','Gender'])

print(df)

### Reading/Loading iris dataset from the web/local_device and storing it in a DataFrame:


In [None]:
# loading a CSV file from local device and store it in a pandas DataFrame:
# "read_csv" is a pandas function to read csv files from web or local device:

df = pd.read_csv('/Users/mpourho/Documents/CSU/Courses/CS4661/Datasets/iris.csv')


In [None]:
# reading a CSV file directly from Web, and store it in a pandas DataFrame:
# "read_csv" is a pandas function to read csv files from web or local device:

df = pd.read_csv('https://raw.githubusercontent.com/mpourhoma/CS4661/master/iris.csv')


### DataFrame is a table in pandas with rows and columns to store the dataset

In [None]:
# displaying the DataFrame:

df  # you can also use print(df)

In [None]:
# display the first 5 rows (default) of Dataframe data:

df.head()

In [None]:
# display the first 7 rows (default) of Dataframe data:

df.head(7)

In [None]:
# display the last 5 rows of Dataframe data:

df.tail()

In [None]:
# check the shape of the DataFrame (rows, columns):

df.shape

In [None]:
# returns the list (names) of columns:

col = df.columns
print (col)

### Working with columns of a DataFrame:

In [None]:
# Selecting a column of a DataFrame:
# [] returns a column as a series

df['sepal_length']

In [None]:
# Selecting several columns of a DataFrame:
# we can use [[]] in one line (notice the double brackets!)
# [[]] returns column(s) as a new DataFrame

df[['sepal_length','petal_length']]


### Working with rows of a DataFrame:

In [None]:
dic = {'first_name':['Tom','Ryan','Jack','Sarah'],'Gender':['male','male','male','female']}
df = pd.DataFrame(dic)
print(df)
print('\n')

print(df[0:2]) # rows 0 and 1
print('\n')

print(df[-1:]) # the last row
print('\n')

print(df[2:4]) # rows 2-3
print('\n')

print(df[:-1]) # all but the last row
print('\n')

### Working with cells of a DataFrame:

In [None]:
df = pd.read_csv('https://raw.githubusercontent.com/mpourhoma/CS4661/master/iris.csv')

# Selecting a cell by row and column names:

print(df.loc[3,'sepal_length'])

In [None]:
# Selecting and slicing by row and column names:

df.loc[3:7,['sepal_length','sepal_width']]

In [None]:
# Selecting and slicing by position in the table:

df.iloc[60:75,0:3]

### Filtering the rows based on a condition on column(s):

In [None]:
# Selecting only the rows for which sepal_length > 7.5

df[df['sepal_length'] > 7.5]

In [None]:
df[df['sepal_length'] == 6.5]

In [None]:
df.loc[df['sepal_length'] == 6.5,['sepal_length','sepal_width']]

In [None]:
# Selecting the rows based on conditions on multiple columns:

df[(df['sepal_length'] > 7.5) & (df['petal_length'] < 6.7)]

In [None]:
# applying a function on one specific column of a DataFrame:

# defining a function:
def myInc(a):
    a += 1
    return a 


# applying the above function on one column on the DataFrame:

df['sepal_length'] = myInc(df['sepal_length'])

df.head()

In [None]:
# More "Pythonic" way:

# applying the above function on one column on the DataFrame using method "apply":

df['sepal_length'] = df['sepal_length'].apply(myInc)

df.head()

### It is common in sklearn to save Features in a table called Feature Matrix, and the Labels in a separate vector called Label Vector or Response Vector.

In [None]:
# Creating the Feature Matrix for iris dataset:

# create a python list of feature names that would like to pick from the dataset:
feature_cols = ['sepal_length','sepal_width','petal_length','petal_width']

# use the above list to select the features from the original DataFrame
X = df[feature_cols]  

# print the first 5 rows
X.head()

In [None]:
# checking the size of Feature Matix X:
print(X.shape)

In [None]:
# Creating the Feature Matrix for iris dataset:

# equivalently we can use [[]] in one line (notice the double brackets!)
X = df[['sepal_length','sepal_width','petal_length','petal_width']]

# print the first 5 rows
X.head()

In [None]:
# select a Series of labels (the last column) from the DataFrame
y = df['species']

# print the first 5 values
y.head()


#### We will talk more about pandas in next tutorials!



   








