# DataFrame and Series Basics - Selecting Rows and Columns

In [1]:
import pandas as pd

 We can create dataframe from a dictionary of lists where,
 dictionary keys = columns
 list indexes = rows (All lists should be the same length)

In [25]:
people = {
    'name': ['Syed', 'Saquib', 'Saeed'],
    'last': ['Atif', 'Faraz', 'Shaheen'],
    'email': ['abc@abc.com', 'JaneDoe@Yahoo.com', 'Saeed@hotmail.com']
}

In [26]:
df = pd.DataFrame(people)
df

Unnamed: 0,name,last,email
0,Syed,Atif,abc@abc.com
1,Saquib,Faraz,JaneDoe@Yahoo.com
2,Saeed,Shaheen,Saeed@hotmail.com


In [27]:
type(df)

pandas.core.frame.DataFrame

## Fetching data from Columns:
When we access a row or column in a dataframe, we get a Series object

In [28]:
df['email']  # Can also use the dot-notation ie df.email to get the same series object

0          abc@abc.com
1    JaneDoe@Yahoo.com
2    Saeed@hotmail.com
Name: email, dtype: object

In [29]:
type(df['email'])

pandas.core.series.Series

A dataframe is like a 2 dimensionsal data structure because it has rows and columns. It's like a container for multiple series objects.

Series is like a one dimensional array (or a list of data) but like a dataframe, has a lot of functions for series processing. In otherwords, it's a row of a single column

> **column vs dot notation**
> column notation is prefered because if you have a column name whcih is the same as a dataframe property, it will return the result of the property instead of the column. For example, if your df has a column called count, df['count'] wil return column eries where as df.count returns the method, count(), of the dataframe. It's a personal preference what you want to use and neither is wrong

## Fetching data from multiple columns

In [30]:
df[['last', 'email']]

Unnamed: 0,last,email
0,Atif,abc@abc.com
1,Faraz,JaneDoe@Yahoo.com
2,Shaheen,Saeed@hotmail.com


We can pass in a list of columns to get a df containing thise columns. Note, that we are passing a list. If we pass them directly into the dataframe's square brackets, pandas will try to treat both strings as a single column name eg

In [31]:
# df['last', 'email']  # KeyError

> Note: When we fetch multiple columns, the result is no longer a series because a series represents a single column. Notice the datatype returned using a single column vs multi column fetch:

In [32]:
type(df['last'])

pandas.core.series.Series

In [33]:
type(df[['last']])

pandas.core.frame.DataFrame

## Getting list of columns in a dataframe

In [34]:
df.columns

Index(['name', 'last', 'email'], dtype='object')

# Fetching data from rows:
We use **iloc** and **loc** to fetch data from rows.

### iloc
iloc (integer location) helps us fetch rows using the integer location in the index.
>Note: The indexes now are the column names

In [54]:
df.iloc[0] # returns a series of the first row of data

name            Syed
last            Atif
email    abc@abc.com
Name: 0, dtype: object

# Fetching data from multiple rows:
In a similar way (multiple columns), we can access multiple rows by passing a list of indices:

In [55]:
df.iloc[[0,2]] # return a dataframe as it has more than one rows of data

Unnamed: 0,name,last,email
0,Syed,Atif,abc@abc.com
2,Saeed,Shaheen,Saeed@hotmail.com


# Fetching data using both rows and columns

iloc also takes a list of row index followed by a list if column index using comma as a separator
> Note: The i in iloc stands for index. This is why it is designed to accept column indexes (instead of column's string labels)

In [62]:
df.iloc[[0,1], 2] # series

0          abc@abc.com
1    JaneDoe@Yahoo.com
Name: email, dtype: object

In [63]:
df.iloc[[0,1], [0,2]] # dataframe

Unnamed: 0,name,email
0,Syed,abc@abc.com
1,Saquib,JaneDoe@Yahoo.com


In [64]:
df.iloc[0, [0,2]] # Series

name            Syed
email    abc@abc.com
Name: 0, dtype: object

>Note: If iloc has only one index or column, it returns a series instead of a dataframe

### loc
For fetching data via row indexes, loc is the same as iloc

In [65]:
df.loc[[0,2]]

Unnamed: 0,name,last,email
0,Syed,Atif,abc@abc.com
2,Saeed,Shaheen,Saeed@hotmail.com


For columns, we cannot use the column indexes (like iloc). Instead, we the column labels / strings directly.

In [67]:
df.loc[[0,1], ['email', 'name']] # this will throw a KeyError

Unnamed: 0,email,name
0,abc@abc.com,Syed
1,JaneDoe@Yahoo.com,Saquib


> Note: The order of the columns in the dataframe is the same as defined in the column list

we will now run the above functionality on our stackoverflow dataset:

In [68]:
df = pd.read_csv('./data/survey_results_public.csv')
schema_df = pd.read_csv('./data/survey_results_schema.csv')

In [69]:
df.head()

Unnamed: 0,Respondent,MainBranch,Hobbyist,OpenSourcer,OpenSource,Employment,Country,Student,EdLevel,UndergradMajor,...,WelcomeChange,SONewContent,Age,Gender,Trans,Sexuality,Ethnicity,Dependents,SurveyLength,SurveyEase
0,1,I am a student who is learning to code,Yes,Never,The quality of OSS and closed source software ...,"Not employed, and not looking for work",United Kingdom,No,Primary/elementary school,,...,Just as welcome now as I felt last year,Tech articles written by other developers;Indu...,14.0,Man,No,Straight / Heterosexual,,No,Appropriate in length,Neither easy nor difficult
1,2,I am a student who is learning to code,No,Less than once per year,The quality of OSS and closed source software ...,"Not employed, but looking for work",Bosnia and Herzegovina,"Yes, full-time","Secondary school (e.g. American high school, G...",,...,Just as welcome now as I felt last year,Tech articles written by other developers;Indu...,19.0,Man,No,Straight / Heterosexual,,No,Appropriate in length,Neither easy nor difficult
2,3,"I am not primarily a developer, but I write co...",Yes,Never,The quality of OSS and closed source software ...,Employed full-time,Thailand,No,"Bachelor’s degree (BA, BS, B.Eng., etc.)",Web development or web design,...,Just as welcome now as I felt last year,Tech meetups or events in your area;Courses on...,28.0,Man,No,Straight / Heterosexual,,Yes,Appropriate in length,Neither easy nor difficult
3,4,I am a developer by profession,No,Never,The quality of OSS and closed source software ...,Employed full-time,United States,No,"Bachelor’s degree (BA, BS, B.Eng., etc.)","Computer science, computer engineering, or sof...",...,Just as welcome now as I felt last year,Tech articles written by other developers;Indu...,22.0,Man,No,Straight / Heterosexual,White or of European descent,No,Appropriate in length,Easy
4,5,I am a developer by profession,Yes,Once a month or more often,"OSS is, on average, of HIGHER quality than pro...",Employed full-time,Ukraine,No,"Bachelor’s degree (BA, BS, B.Eng., etc.)","Computer science, computer engineering, or sof...",...,Just as welcome now as I felt last year,Tech meetups or events in your area;Courses on...,30.0,Man,No,Straight / Heterosexual,White or of European descent;Multiracial,No,Appropriate in length,Easy


In [71]:
# Lets see how many rows and columns we have in this df
df.shape

(88883, 85)

In [73]:
# To see what columns are available
df.columns

Index(['Respondent', 'MainBranch', 'Hobbyist', 'OpenSourcer', 'OpenSource',
       'Employment', 'Country', 'Student', 'EdLevel', 'UndergradMajor',
       'EduOther', 'OrgSize', 'DevType', 'YearsCode', 'Age1stCode',
       'YearsCodePro', 'CareerSat', 'JobSat', 'MgrIdiot', 'MgrMoney',
       'MgrWant', 'JobSeek', 'LastHireDate', 'LastInt', 'FizzBuzz',
       'JobFactors', 'ResumeUpdate', 'CurrencySymbol', 'CurrencyDesc',
       'CompTotal', 'CompFreq', 'ConvertedComp', 'WorkWeekHrs', 'WorkPlan',
       'WorkChallenge', 'WorkRemote', 'WorkLoc', 'ImpSyn', 'CodeRev',
       'CodeRevHrs', 'UnitTests', 'PurchaseHow', 'PurchaseWhat',
       'LanguageWorkedWith', 'LanguageDesireNextYear', 'DatabaseWorkedWith',
       'DatabaseDesireNextYear', 'PlatformWorkedWith',
       'PlatformDesireNextYear', 'WebFrameWorkedWith',
       'WebFrameDesireNextYear', 'MiscTechWorkedWith',
       'MiscTechDesireNextYear', 'DevEnviron', 'OpSys', 'Containers',
       'BlockchainOrg', 'BlockchainIs', 'BetterLife'

In [74]:
# to grab the hobbyist data
df['Hobbyist']

0        Yes
1         No
2        Yes
3         No
4        Yes
        ... 
88878    Yes
88879     No
88880     No
88881     No
88882    Yes
Name: Hobbyist, Length: 88883, dtype: object

In [76]:
# to count how many said yes and how many said no we use value_counts() method. 
# more about that later
df['Hobbyist'].value_counts()

Yes    71257
No     17626
Name: Hobbyist, dtype: int64

In [77]:
# Grab first row
df.loc[0]

Respondent                                                      1
MainBranch                 I am a student who is learning to code
Hobbyist                                                      Yes
OpenSourcer                                                 Never
OpenSource      The quality of OSS and closed source software ...
                                      ...                        
Sexuality                                 Straight / Heterosexual
Ethnicity                                                     NaN
Dependents                                                     No
SurveyLength                                Appropriate in length
SurveyEase                             Neither easy nor difficult
Name: 0, Length: 85, dtype: object

In [78]:
# Grab first row
df.loc[0, 'Hobbyist']

'Yes'

In [79]:
# Grab first 3 rows
df.loc[[0,1,2], 'Hobbyist']

0    Yes
1     No
2    Yes
Name: Hobbyist, dtype: object

## Slicing
We can use the usual slicing format for bot indexes and columns.
> Note:
>
> 1.Unlike normal slicing, the second value is inclusive. This was done because column names use slicing aswell and the last column label not being inclusive is against common sense.
>
> 2.Dont need [] brackets to define a slice 

In [81]:
df.loc[0:20, 'Hobbyist':'Employment']

Unnamed: 0,Hobbyist,OpenSourcer,OpenSource,Employment
0,Yes,Never,The quality of OSS and closed source software ...,"Not employed, and not looking for work"
1,No,Less than once per year,The quality of OSS and closed source software ...,"Not employed, but looking for work"
2,Yes,Never,The quality of OSS and closed source software ...,Employed full-time
3,No,Never,The quality of OSS and closed source software ...,Employed full-time
4,Yes,Once a month or more often,"OSS is, on average, of HIGHER quality than pro...",Employed full-time
5,Yes,Never,The quality of OSS and closed source software ...,Employed full-time
6,No,Never,The quality of OSS and closed source software ...,"Independent contractor, freelancer, or self-em..."
7,Yes,Less than once per year,"OSS is, on average, of HIGHER quality than pro...","Not employed, but looking for work"
8,Yes,Once a month or more often,The quality of OSS and closed source software ...,Employed full-time
9,Yes,Once a month or more often,"OSS is, on average, of HIGHER quality than pro...",Employed full-time
