Pandas is used in data analytics to get insights from large amounts of data

<h1>Dataframe Basics</h1>
<h3>Topics</h3>
<ul>
    <li>Creating Dataframe</li>
    <li>Dealing with rows and columns</li>
    <li>Operations:min, max, std, describe</li>
    <li>Conditional Selection</li>
    <li>set_index</li>
</ul>
<p>Dataframes are objects(methods and attributes) that are used to represent tabular data e.g excel files, CSVs, SQL etc</p>

In [25]:
import pandas as pd
# Importing data to dataframe
df = pd.read_csv("weather_data_nyc_centralpark_2016(1).csv") # Specify name of csv and use pd.read_csv

# Dataframe from python dictionary where the keys are column names an rows are values
dict1 = {
    'day': ['1/1/2017', '1/2/2017', '1/3/2017', '1/4/2017', '1/5/2017', '1/6/2017'],
    'temperature': [32, 35, 28, 24, 32, 31],
    'windspeed': [6, 7, 2, 7, 4, 2],
    'event': ['Rain', 'Sunny', 'Snow', 'Snow', 'Rain', 'Sunny']
}
df2 = pd.DataFrame(dict1)
print(df2.shape) # Gets number of rows and columns in dataframe
print(df2.head())  # Gets inital rows, df.tail(x) prints last x rows
print(df2.columns) # Prints the columns in the dataframe

# Printing shape, first or last x rows, columns

(6, 4)
        day  temperature  windspeed  event
0  1/1/2017           32          6   Rain
1  1/2/2017           35          7  Sunny
2  1/3/2017           28          2   Snow
3  1/4/2017           24          7   Snow
4  1/5/2017           32          4   Rain
Index(['day', 'temperature', 'windspeed', 'event'], dtype='object')


In [10]:
# Indexing dataframe
print(df2[2:5])  # Returns a new dataframe that contains the 2nd to 4th rows of dataframe(rows and columns)
print(df2['day']) # Returns series containing data in specific column
# To get a specific item, index the column and row e.g
print(df2['day'][0])
# To get multiple columns index with a list of column names
print(df2[['day', 'event', 'temperature']]) # Returns a new dataframe with only the specified columns

        day  temperature  windspeed event
2  1/3/2017           28          2  Snow
3  1/4/2017           24          7  Snow
4  1/5/2017           32          4  Rain
0    1/1/2017
1    1/2/2017
2    1/3/2017
3    1/4/2017
4    1/5/2017
5    1/6/2017
Name: day, dtype: object
1/1/2017
        day  event  temperature
0  1/1/2017   Rain           32
1  1/2/2017  Sunny           35
2  1/3/2017   Snow           28
3  1/4/2017   Snow           24
4  1/5/2017   Rain           32
5  1/6/2017  Sunny           31


In [17]:
# Operations
print(df2['temperature'].max())  # Get biggest element
# Mean - .mean(), Std - .std(), .min()
print(df2.describe()) # To get general stats
# You can also index with a condition
print(df2[df2.temperature > 32]) # -> Select all columns were temperature > 32
print(df2[['day', 'temperature']][df2.temperature >= 32]) # -> Select day, temperature FROM df2 WHERE temperature >= 32

35
       temperature  windspeed
count     6.000000   6.000000
mean     30.333333   4.666667
std       3.829708   2.338090
min      24.000000   2.000000
25%      28.750000   2.500000
50%      31.500000   5.000000
75%      32.000000   6.750000
max      35.000000   7.000000
        day  temperature  windspeed  event
1  1/2/2017           35          7  Sunny
        day  temperature
0  1/1/2017           32
1  1/2/2017           35
4  1/5/2017           32


In [27]:
# Change the index of the dataframe
df2.set_index('day', inplace=True)  # Updates the dataframe and makes the index column to be day
df2.loc['1/1/2017'] # Used to get a row at that index value

temperature      32
windspeed         6
event          Rain
Name: 1/1/2017, dtype: object

<h1>Different Ways of Creating Dataframe</h1>
<h3>Topics</h3>
<ol>
    <li>Using CSV</li>
    <li>Using Excel</li>
    <li>From python dictionary</li>
    <li>From list of tuples</li>
    <li>From list of dictionaries</li>
</ol>

In [28]:
df = pd.read_csv('weather_data_nyc_centralpark_2016(1).csv')
# Used to read from excel file pd.read_excel('file_name', 'sheet_name')
# pd.DataFrame(dictionary) - keys are column names and values are rows
days = ('1/1/2017', '1/2/2017', '1/3/2017')
temperature = (32, 35, 28)
windspeed = (6, 7, 2)
event = ('Rain', 'Sunny', 'Snow')
dataframe_data = zip(days, temperature, windspeed, event) # Creates a list of tuples
df = pd.DataFrame(dataframe_data, columns=['Day', 'Temperature', 'Windspeed', 'Event']) # Specify column names
df
# pd.DataFrame(data) can also be used to create a dataframe from a list of dictionaries, where each dictionary has values 
# for one row


Unnamed: 0,Day,Temperature,Windspeed,Event
0,1/1/2017,32,6,Rain
1,1/2/2017,35,7,Sunny
2,1/3/2017,28,2,Snow


<h1>Reading and writing CSV and Excel Files</h1>
<h3>Topics</h3>
<ul>
    <li>Read CSV</li>
    <li>Write CSV</li>
    <li>Read Excel</li>
    <li>Write Excel</li>
</ul>

In [44]:
# Options for reading from CSV
# df = pd.read_csv('weather_data_nyc_centralpark_2016(1).csv') - Default
# Problem with default is that if there are any rows that don't contain the data the dataframe won't be imported well
df = pd.read_csv('weather_data_nyc_centralpark_2016(1).csv', skiprows=1) # Skips first row coz it has useless data

# Let's say the CSV file has no header names
df2 = pd.read_csv('weather_data_nyc_centralpark_2016(2).csv') # This makes panda assume that first row is header
df2 = pd.read_csv('weather_data_nyc_centralpark_2016(2).csv', header=None) # Tells panda that there is no header but it automatically generates header names
df2 = pd.read_csv('weather_data_nyc_centralpark_2016(2).csv', header=None, 
                   names=['date', 'maximum temperature', 'minimum temperature', 'average temperature', 'precipitation', 'snow fall', 'snow depth'])

# Let's say that the CSV has represented missing values with certain strings or symbols
# In this case missing values are represented with NA, missing, 9000 and T
df3 = pd.read_csv('weather_data_nyc_centralpark_2016(2).csv', header=None, 
                names=['date', 'maximum temperature', 'minimum temperature', 'average temperature', 'precipitation', 'snow fall', 'snow depth'], 
                na_values={ # Allows you to specify missing values for specific columns
                    'snow depth': ['NA', 'missing', 9000],
                    'date': [42]
                })
df3


Unnamed: 0,date,maximum temperature,minimum temperature,average temperature,precipitation,snow fall,snow depth
0,,42,34,38.0,0.00,0.0,
1,2-1-2016,40,32,36.0,0.00,0.0,
2,3-1-2016,45,35,40.0,0.00,0.0,0
3,4-1-2016,36,14,25.0,0.00,0.0,
4,5-1-2016,29,11,20.0,0.00,0.0,0
...,...,...,...,...,...,...,...
361,27-12-2016,60,40,50.0,0,0,0
362,28-12-2016,40,34,37.0,0,0,0
363,29-12-2016,46,33,39.5,0.39,0,0
364,30-12-2016,40,33,36.5,0.01,T,0


In [48]:
# Export dataframe as CSV
df3.to_csv('new.csv') # Creates a problem, the index was included into the CSV! 
df3.to_csv('new.csv', index=False)
# columns arguement is a list of the columns you want to export as CSV
df3.to_csv('new.csv', index=False, header=False) # To not export header names as well
# To use a function to convert certain values to another use converters arguement

# To export dataframe as excel file use df.to_excel('filename', sheet_name='sheet') # Same parameters from to_csv are applicable
# Use pd.ExcelWriter('filename') to write multiple dataframes to same excel file but different sheets

<h1>Handle Missing Data</h1>
<h3>Topics</h3>
<ul>
    <li>fillna</li>
    <li>interpolate</li>
    <li>dropna</li>
</ul>