## DataFrame

We studied series up to now. 

We will deal with DataFrames here. 

The DataFrame data structure is the heart of the Panda's library. 

It's a primary object that you'll be working with in data analysis and cleaning tasks.

The DataFrame is conceptually a two-dimensional series object, where there's an index and multiple columns of  content, with each column having a label. In fact, the distinction between a column and a row is really only a  conceptual distinction. And you can think of the DataFrame itself as simply a two-axes labeled array.

In [None]:
# Lets start by importing our pandas library
import pandas as pd

In [None]:
# following codes up to the plot, including the plot, 
# are from https://www.geeksforgeeks.org/creating-a-dataframe-from-pandas-series/

In [None]:
#  Creating Dataframe from Series

author = ['Mehmet', 'Ali', 'Ayberk', 'Batuhan']
article = [210, 211, 114, 178]
  
auth_series    = pd.Series(author)
article_series = pd.Series(article)
article_series

In [None]:

frame = { 'Author': auth_series, 'Article': article_series }
#this is a dictionary
#convert the dictionary to a dataframe
frame

In [None]:

  
result = pd.DataFrame(frame)
result

In [None]:
#How to add series externally in dataframe

age = [17, 21, 24,32]

result['Age'] = pd.Series(age)

result
#We have added one more series externally named as age of the authors, 
# then directly added this series in the pandas dataframe. 
# Remember one thing if any value is missing then by default 
# it will be converted into NaN value i.e null by default.

In [None]:
town = ["Ist", "Ank", "Trb"]
  
result['Town'] = pd.Series(town)
  
result

In [None]:
import matplotlib.pyplot as plt
result.plot.bar()
plt.show()

In [None]:
# Lets create three school records for students and their  class grades.
# I'll create each as a series which has a student name, the class name, and the score. 
record1 = pd.Series({    'Name': 'Alice',
                        'Class': 'Physics',
                        'Score': 85})

record2 = pd.Series({    'Name': 'Jack',
                        'Class': 'Chemistry',
                        'Score': 82})

record3 = pd.Series({   'Name': 'Helen',
                        'Class': 'Biology',
                        'Score': 90})

In [None]:
record3

In [None]:
record1['Name']

In [None]:
# Like a Series, the DataFrame object is index. Here I'll use a group of series, where each series 
# represents a row of data. Just like the Series function, we can pass in our individual items
# in an array, and we can pass in our index values as a second arguments
df = pd.DataFrame([record1, record2, record3],
                  index=['school1', 'school2', 'school1'])

# And just like the Series we can use the head() function to see the first several rows of the
# dataframe, including indices from both axes, and we can use this to verify the columns and the rows
df

In [78]:
# An alternative method is that you could use --a list of dictionaries--, 
# where each dictionary represents a row of data.

students = [{'Name': 'Alice',
              'Class': 'Physics',
              'Score': 85},
            {'Name': 'Jack',
             'Class': 'Chemistry',
             'Score': 82},
            {'Name': 'Helen',
             'Class': 'Biology',
             'Score': 90}]
#df = pd.DataFrame([record1, record2, record3],
#                  index=['school1', 'school2', 'school1'])

df = pd.DataFrame(students, index=['school1', 'school2', 'school1'])

df.head()

Unnamed: 0,Class,Name,Score
school1,Physics,Alice,85
school2,Chemistry,Jack,82
school1,Biology,Helen,90


In [None]:
df['Class'] # you can read columns 

In [None]:
df['school1'] # you canNOT read rows 

In [46]:
df.loc['school2']

Class    Chemistry
Name          Jack
Score           82
Name: school2, dtype: object

In [None]:
record2

In [None]:
# The name of the series is returned as the index value, 
# while the column name is included in the output.

# We can check the data type of the return using the python type function.
type(df.loc['school2'])

In [None]:
# School1 has multiple rows. 
# Lets query for school1 records
df.loc['school1']

In [None]:
# And we can see the the type of this is different too
type(df.loc['school1'])

In [None]:
type(df.loc['school2'])

In [None]:
# you can  supply two parameters to .loc, one being the row index and the other being the column name.
# For instance, if we are only interested in school1's student names
df.loc['school1', 'Name']

In [None]:
# we can use transpose
df.T # transpose

In [None]:
df.T['school1'] # recall that df['school1'] is not possible

In [None]:
# Then we can call .loc on the transpose to get the student names only
df.T['Name'] # we will have an error for this

In [37]:
df.T.loc['Name']

school1    Alice
school2     Jack
school1    Helen
Name: Name, dtype: object

In [None]:
# In practice, this works really well since you're often trying to add or drop new columns. However,
# this also means that you get a key error if you try and use .loc with a column name
df.T.loc['school1']


In [None]:
df

In [38]:
# Note too that the result of a single column projection is a Series object
type(df['Name'])

pandas.core.series.Series

In [43]:
# CHAINING:
df.loc['school1']['Name']

school1    Alice
school1    Helen
Name: Name, dtype: object

In [41]:
# Since the result of using the indexing operator is either a DataFrame or Series, you can chain 
# operations together. For instance, we can select all of the rows which related to school1 using
# .loc, then project the name column from just those rows
df.loc['school1']['Name']
#==> First find the elements from school1, and then pick the "name" column from there. 

school1    Alice
school1    Helen
Name: Name, dtype: object

In [44]:
df.loc['school1', 'Name'] 
# remember we also did the same thing before. 

school1    Alice
school1    Helen
Name: Name, dtype: object

In [None]:
### THen what is the difference? 



In [None]:
# Chaining, by indexing on the return type of another index, can come with some costs 
# and is best avoided if you can use another approach. 
# In particular, chaining tends to cause Pandas 
# to return ## a copy ## of the DataFrame instead of ##a view## on the DataFrame. 
# For selecting data, this is not a big deal, though it might be slower than necessary. 
# If you are #chanGing# data, though this is an important distinction and can be a source of error.

In [45]:
df.loc['school2']['Name'] = 'mehmet'

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.


In [47]:
df

Unnamed: 0,Class,Name,Score
school1,Physics,Alice,85
school2,Chemistry,Jack,82
school1,Biology,Helen,90


In [48]:
df.loc['school2','Name'] = 'mehmet'

In [49]:
df

Unnamed: 0,Class,Name,Score
school1,Physics,Alice,85
school2,Chemistry,mehmet,82
school1,Biology,Helen,90


In [50]:

print(type(df.loc['school2']['Name']))
print(type(df.loc['school1']['Name']))
print(type(df.loc['school2']))
print(type(df.loc['school1'])) # the one with two lines

<class 'str'>
<class 'pandas.core.series.Series'>
<class 'pandas.core.series.Series'>
<class 'pandas.core.frame.DataFrame'>


In [54]:
df.loc['school1']

Unnamed: 0,Class,Name,Score
school1,Physics,Alice,85
school1,Biology,Helen,90


In [55]:
df.loc['school1'] = "a"
df

Unnamed: 0,Class,Name,Score
school1,a,a,a
school2,Chemistry,mehmet,82
school1,a,a,a


In [None]:
# SLICING
# If you need more than one columns...


In [58]:
# Here's an example, where we ask for all the names and scores for all schools using the 
# .loc operator.
df

Unnamed: 0,Class,Name,Score
school1,Physics,Alice,85
school2,Chemistry,Jack,82
school1,Biology,Helen,90


In [59]:
df.loc['school1',['Name', 'Score']]

Unnamed: 0,Name,Score
school1,Alice,85
school1,Helen,90


In [60]:
df.loc[: , ['Name', 'Score']]


Unnamed: 0,Name,Score
school1,Alice,85
school2,Jack,82
school1,Helen,90


In [61]:
df.loc["school1", :]


Unnamed: 0,Class,Name,Score
school1,Physics,Alice,85
school1,Biology,Helen,90


In [None]:
# Take a look at that again. The columnn means that we want to get all of the rows, and the list
# in the second argument position is the list of columns we want to get back

In [None]:
# That's selecting and projecting data from a DataFrame based on row and column labels. The key 
# concepts to remember are that the rows and columns are really just for our benefit. Underneath 
# this is just a two axes labeled array, and transposing the columns is easy. Also, consider the 
# issue of chaining carefully, and try to avoid it, as it can cause unpredictable results, where 
# your intent was to obtain a view of the data, but instead Pandas returns to you a ##copy###. 

In [62]:
df

Unnamed: 0,Class,Name,Score
school1,Physics,Alice,85
school2,Chemistry,Jack,82
school1,Biology,Helen,90


In [63]:
# Before we leave the discussion of accessing data in DataFrames, lets talk about dropping data.
# It's easy to delete data in Series and DataFrames, and we can use the drop function to do so. 
# This function takes a single parameter, which is the row label, to drop. This is another 
# tricky place for new users -- ## the drop function doesn't change the DataFrame by default!## 
# Instead, the drop function returns to you a copy of the DataFrame with the given rows removed.

df.drop('school1') #gives a copy of df with the given rows removed. 

Unnamed: 0,Class,Name,Score
school2,Chemistry,Jack,82


In [64]:
# But if we look at our original DataFrame we see the data is still intact.
df

Unnamed: 0,Class,Name,Score
school1,Physics,Alice,85
school2,Chemistry,Jack,82
school1,Biology,Helen,90


In [None]:
# Drop has two interesting optional parameters. 
# 1) inplace
# if it's set to true, the DataFrame will be updated in place, instead of a copy being returned. 
# 2) axis 
# The second parameter is the axes, which should be dropped. 
# 0, ==> Row,    1 ==> column.

In [68]:
# For example, lets make a copy of a DataFrame using .copy()
copy_df  = df.copy()
copy_df2 = df
copy_df



Unnamed: 0,Class,Name,Score
school1,Physics,Alice,85
school2,Chemistry,Jack,82
school1,Biology,Helen,90


In [None]:
# Why did we do this and use a copy function instead of creating a copy with copy_df = df.copy()

#later...

In [72]:
# Now lets drop the name column in this copy
copy_df.drop("school2", inplace=True, axis=0)
copy_df

Unnamed: 0,Class,Name,Score
school1,Physics,Alice,85
school1,Biology,Helen,90


In [73]:
# There is a second way to drop a column, and that's directly through the use of the indexing 
# operator, using the del keyword. This way of dropping data, however, takes immediate effect 
# on the DataFrame and does not return a view.
del copy_df['Class']
copy_df

Unnamed: 0,Name,Score
school1,Alice,85
school1,Helen,90


In [74]:
# Finally, adding a new column to the DataFrame is as easy as assigning it to some value using
# the indexing operator. For instance, if we wanted to add a class ranking column with default 
# value of None, we could do so by using the assignment operator after the square brackets.
# This broadcasts the default value to the new column immediately.

df['ClassRanking'] = None
df

Unnamed: 0,Class,Name,Score,ClassRanking
school1,Physics,Alice,85,
school2,Chemistry,Jack,82,
school1,Biology,Helen,90,


In this lecture you've learned about the data structure you'll use the most in pandas, the DataFrame. The 
dataframe is indexed both by row and column, and you can easily select individual rows and project the columns 
you're interested in using the familiar indexing methods from the Series class. You'll be gaining a lot of 
experience with the DataFrame in the content to come.

In [79]:
df['ClassRanking'] = [22,21,20]
df

Unnamed: 0,Class,Name,Score,ClassRanking
school1,Physics,Alice,85,22
school2,Chemistry,Jack,82,21
school1,Biology,Helen,90,20


In [76]:
df.loc['school1', "Score"] =23
df


Unnamed: 0,Class,Name,Score,ClassRanking
school1,Physics,Alice,23,22
school2,Chemistry,Jack,82,21
school1,Biology,Helen,23,20


In [None]:
## Now we are checking why we used copy function...

In [81]:
df2 = df
df2

Unnamed: 0,Class,Name,Score,ClassRanking
school1,Physics,Alice,85,22
school2,Chemistry,Jack,82,21
school1,Biology,Helen,90,20


In [82]:
df2.loc['school2'] = 1
df2

Unnamed: 0,Class,Name,Score,ClassRanking
school1,Physics,Alice,85,22
school2,1,1,1,1
school1,Biology,Helen,90,20


In [83]:
df

Unnamed: 0,Class,Name,Score,ClassRanking
school1,Physics,Alice,85,22
school2,1,1,1,1
school1,Biology,Helen,90,20
