## DataFrame

- The DataFrame is conceptually a two-dimensional series object, where there's an index and multiple columns of  content, with each column having a label. 
- In fact, the distinction between a column and a row is really only a  conceptual distinction. 
- you can think of the DataFrame itself as simply a two-axes labeled array.
- It's just like excel table in Python.


In [None]:
# Generating Dataframe from dictionaries and series

In [None]:
# Lets start by importing our pandas library
import pandas as pd

####  Creating Dataframe - 1

In [None]:


author = ['Mehmet', 'Ali', 'Ayberk', 'Batuhan']
article = [210, 211, 114, 178]
  
auth_series    = pd.Series(author)
article_series = pd.Series(article)
article_series

In [None]:

frame = { 'Author': auth_series, 'Article': article_series }
#this is a dictionary
#convert the dictionary to a dataframe
frame

In [None]:

  
result = pd.DataFrame(frame)
result

#### How to add series externally in dataframe


In [None]:

age = [17, 21, 24,32]

result['Age'] = pd.Series(age)

result
#We have added one more series externally named as age of the authors, 
# then directly added this series in the pandas dataframe. 
# Remember one thing if any value is missing then by default 
# it will be converted into NaN value i.e null by default.

In [None]:
town = ["Ist", "Ank", "Trb"]
  
result['Town'] = pd.Series(town)
  
result

In [None]:
import matplotlib.pyplot as plt
result.plot.bar()
plt.show()

#### Creating Dataframe - 2

In [None]:
# Lets create three school records for students and their  class grades.
# I'll create each as a series which has a student name, the class name, and the score. 
record1 = pd.Series({    'Name': 'Alice',
                        'Class': 'Physics',
                        'Score': 85})

record2 = pd.Series({    'Name': 'Jack',
                        'Class': 'Chemistry',
                        'Score': 82})

record3 = pd.Series({   'Name': 'Helen',
                        'Class': 'Biology',
                        'Score': 90})

In [None]:
record3

In [None]:
record1['Name']

In [None]:
# Like a Series, the DataFrame object is index. Here I'll use a group of series, where each series 
# represents a row of data. Just like the Series function, we can pass in our individual items
# in an array, and we can pass in our index values as a second arguments
df = pd.DataFrame([record1, record2, record3], index=['school1', 'school2', 'school1'])

# And just like the Series we can use the head() function to see the first several rows of the
# dataframe, including indices from both axes, and we can use this to verify the columns and the rows
df

In [None]:
# An alternative method is that you could use --a list of dictionaries--, 
# where each dictionary represents a row of data.

students = [{'Name': 'Alice',
              'Class': 'Physics',
              'Score': 85},
            {'Name': 'Jack',
             'Class': 'Chemistry',
             'Score': 82},
            {'Name': 'Helen',
             'Class': 'Biology',
             'Score': 90}]
#df = pd.DataFrame([record1, record2, record3],
#                  index=['school1', 'school2', 'school1'])

df = pd.DataFrame(students, index=['school1', 'school2', 'school1'])

df.head()

In [None]:
df['Class'] # you can read columns 

In [None]:
df['school2'] # you canNOT read rows 

In [None]:
df.loc['school2']

In [None]:
# School1 has multiple rows. 
# Lets query for school1 records
df.loc['school1']

In [None]:
df.loc['school1', 'Name']

#### Transpose

In [None]:
# we can use transpose
df.T # transpose

In [None]:
df.T['school1'] # recall that df['school1'] is not possible

In [None]:
# Then we can call .loc on the transpose to get the student names only
df.T['Name'] # we will have an error for this

In [None]:
df.T.loc['Name']

In [None]:
df

### Chaining

In [None]:

df.loc['school1']['Name'] # called chaining. Returns a copy
# If you are #chanGing# data, though this is an important distinction and can be a source of error.

In [None]:
df.loc['school1', 'Name'] 
# remember we also did the same thing before. 

In [None]:
df.loc['school2']['Name'] = 'mehmet'

In [None]:
df

In [None]:
df.loc['school2','Name'] = 'mehmet'

In [None]:
df

In [None]:
df.loc['school1']

In [None]:
df.loc['school1'] = "a"
df

####  SLICING
If you need more than one columns...


In [None]:
# Here's an example, where we ask for all the names and scores for all schools using the 
# .loc operator.
df

In [None]:
df.loc['school1',['Name', 'Score']]

In [None]:
df.loc[: , ['Name', 'Score']]


In [None]:
df.loc["school1", :]


In [None]:

df.drop('school1', inplace=True) #gives a copy of df with the given rows removed. 

In [None]:
# But if we look at our original DataFrame we see the data is still intact.
df

In [None]:
# Drop has two interesting optional parameters. 
# 1) inplace
# if it's set to true, the DataFrame will be updated in place, instead of a copy being returned. 
# 2) axis 
# The second parameter is the axes, which should be dropped. 
# 0, ==> Row,    1 ==> column.

In [None]:
# For example, lets make a copy of a DataFrame using .copy()
copy_df  = df.copy()
copy_df2 = df
copy_df



In [None]:
# Why did we do this and use a copy function instead of creating a copy with copy_df = df.copy()

#later...

In [None]:
# Now lets drop the name column in this copy
copy_df.drop("school2", inplace=True, axis=0)
copy_df

In [None]:
# Finally, adding a new column to the DataFrame is as easy as assigning it to some value using
# the indexing operator. For instance, if we wanted to add a class ranking column with default 
# value of None, we could do so by using the assignment operator after the square brackets.
# This broadcasts the default value to the new column immediately.

df['ClassRanking'] = None
df

In this lecture you've learned about the data structure you'll use the most in pandas, the DataFrame. The 
dataframe is indexed both by row and column, and you can easily select individual rows and project the columns 
you're interested in using the familiar indexing methods from the Series class. You'll be gaining a lot of 
experience with the DataFrame in the content to come.

In [None]:
df['ClassRanking'] = [22,21,20]
df

In [None]:
df.loc['school1', "Score"] =23
df


### ALIASING

In [None]:
df2 = df
df2

In [None]:
df2.loc['school2'] = 1
df2

In [None]:
df

## DATAFRAME: INDEXING AND LOADING

In [None]:
import pandas as pd

# Pandas mades it easy to turn a CSV into a dataframe, we just call read_csv()
df = pd.read_csv('datasets/Admission_Predict.csv',delimiter=";", index_col=0)

df.columns

In [None]:
X = df.drop(columns=['Chance of Admit '])
#or
#X = df[['gre score', 'toefl score', 'university rating', 'sop', 'lor', 'cgpa','research']]
y = df['Chance of Admit ']

In [None]:
X

### MASKING

In [None]:
# for better representation 
df.columns = [x.lower().strip() for x in df.columns]
# And we'll take a look at the results
df.head(5)

# And let's look at the first few rows
df.head()

In [None]:
admit_mask = df['chance of admit'] > 0.7 # we can use this masking to pick the students
#whose chance of admit is larger than 0.7
admit_mask

In [None]:
df.where(df['chance of admit'] > 0.7).dropna()

In [None]:
#shorthand way of writing above is
df[ df['chance of admit'] > 0.7 ]  #it generates a copy of dataframe


In [None]:
(df['chance of admit'] > 0.7) & (df['chance of admit'] < 0.9)
df['chance of admit'].gt(0.7) & df['chance of admit'].lt(0.9)
df['chance of admit'].gt(0.7).lt(0.9)

In [None]:
### 2.5 is over

### DataFrame Indexing - Continued

In [None]:
df = pd.read_csv("datasets/Admission_Predict.csv", index_col=0, delimiter=";")
df.head(11)

In [None]:
df['SOP'] = df.index
df

In [None]:
df = df.set_index('CGPA')
df.head()

In [None]:
df = df.reset_index()
df.head()

In [None]:
df = pd.read_csv('datasets/MPG.csv',index_col=0)
df.head()

In [None]:
df['manufacturer'].unique()

In [None]:

df2 = df[df['manufacturer'] == "audi"]
df2

### Multi-indexing

In [None]:
df = df.set_index(['manufacturer', 'model'])
df

In [None]:
df.loc["audi", "a4"]

In [None]:
df.loc["audi"]

In [None]:
# Filling the NaN value sin dataframes

df.fillna(0, inplace=True)# if inplace is True, the original df is modified. 


In [None]:
### Applying functions to dataframes. 


In [None]:
import numpy as np
import pandas as pd
df = pd.DataFrame([[9, 16], [4,25], [100,64]])
df

In [None]:
df.apply(np.sqrt)


In [None]:
df.apply(np.sum, axis=0) #axis = 0 means sum the elements in a column


In [None]:
df.apply(np.sum, axis=1)#axis = 1 means sum the elements in a row



In [None]:
## Merging Dataframes

import pandas as pd

# First we create two DataFrames, staff and students.
staff_df = pd.DataFrame([{'Name': 'Kelly', 'Role': 'Director of HR'},
                         {'Name': 'Sally', 'Role': 'Course liasion'},
                         {'Name': 'James', 'Role': 'Grader'}])
# And lets index these staff by name
staff_df = staff_df.set_index('Name')
# Now we'll create a student dataframe
student_df = pd.DataFrame([{'Name': 'James', 'School': 'Business'},
                           {'Name': 'Mike', 'School': 'Law'},
                           {'Name': 'Sally', 'School': 'Engineering'}])
# And we'll index this by name too
student_df = student_df.set_index('Name')

# And lets just print out the dataframes
print(staff_df.head())
print(student_df.head())



In [None]:
pd.merge(staff_df, student_df, how='outer', left_index=True, right_index=True) #Outer = union


In [None]:
pd.merge(staff_df, student_df, how='inner', left_index=True, right_index=True) #innter = intersection


In [None]:
pd.merge(staff_df, student_df, how='left', left_index=True, right_index=True) 



In [None]:
pd.merge(staff_df, student_df, how='right', left_index=True, right_index=True)


## GroupBY

In [None]:

df = pd.read_csv('datasets/mpg.csv',index_col=0)


#### Groupy - Aggregation

In [None]:

df.groupby("manufacturer").agg({"cty":np.average}) # 3 values and 2 nan values, it will be divided by 5.  
df.groupby("manufacturer").agg({"cty":np.nanmean}) #ignores NaN values=> diveded by 3. 

#### Groupby - Transformation

In [None]:

cols=['manufacturer','cty','hwy','trans']
df[cols]


In [None]:
df[cols].groupby('manufacturer').transform(np.nanmean)


#### Groupby - Filtering

In [None]:
df.groupby('manufacturer').filter(lambda x: np.nanmean(x['cty'])>20)

#### Groupby - Applying


In [None]:
df=pd.read_csv("datasets/mpg.csv")
# And lets just include some of the columns we were interested in previously
df=df[['manufacturer','cty']]
df.head(33)

In [None]:
def calc_mean_cty(group):
    # group is a dataframe just of whatever we have grouped by, e.g. manufacturer, 
    #hence we can treat this as the complete dataframe
    avg=np.nanmean(group["cty"])
    #print('avg',avg)
    # now broadcast our formula and create a new column
    group["cty_diff"]=np.abs(avg-group["cty"])
    return group
calc_mean_cty(df)

## PIVOT TABLE


In [None]:
df=pd.read_csv("datasets/mpg.csv", index_col=0)
df.pivot_table(values='cty', index='class', columns='cyl', aggfunc=[np.mean, np.max,np.min])

