Importing pandas as pd is (conventional alias)  
`pandas` -> Name derived from panel data, which is an econoimical term for 
multiDimentionaly structured dataSets (In short tabular data)

In [45]:
import pandas as pd

`dict`: Internal architecture of a pandas dataFrame

In [46]:
people = { # <- Each of the dictionary represents a dataFrame
    "firstName": ["Corey", "Pritam", "Jane", "John"], # <- each of the key represents a column
    "lastName": ["Chaufer", "Kundu", "Doe", "Doe"], # each of the values represent a row
    "email": ["CoreyMSchaufer@gmail.com", "pritamkundu771@gmail.com", "JaneDoe@email.com", "JohnDoe@email.com"]
}

Accessing a column in the above dummy frame

In [47]:
people['email']

['CoreyMSchaufer@gmail.com',
 'pritamkundu771@gmail.com',
 'JaneDoe@email.com',
 'JohnDoe@email.com']

Q. How to convert a `dict()` object to a `dataFrame` in Pandas?

`df` -> Dataframes are the backBones of Pandas. It is just rows and columns of data.  
`shape` -> gives us number of columns in rows and columns 

In [48]:
df = pd.DataFrame(people)
df

Unnamed: 0,firstName,lastName,email
0,Corey,Chaufer,CoreyMSchaufer@gmail.com
1,Pritam,Kundu,pritamkundu771@gmail.com
2,Jane,Doe,JaneDoe@email.com
3,John,Doe,JohnDoe@email.com


- 🎗️ NOTE: **This is an attribute and not a method, no need for ()**

In [49]:
df.shape

(4, 3)

To see all of the columns of informations in the dataFrame  
`info()` -> Gives the number of rows and the number of columns and also all of the datatypes of the columns.

In [50]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4 entries, 0 to 3
Data columns (total 3 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   firstName  4 non-null      object
 1   lastName   4 non-null      object
 2   email      4 non-null      object
dtypes: object(3)
memory usage: 224.0+ bytes


Accessing a column in a dataFrame object, Compaing with pure python. _SIMILAR_

In [51]:
# Dataframe is a superset of all dict feactures as it might internally inherit the dictionary  object
df['email']

0    CoreyMSchaufer@gmail.com
1    pritamkundu771@gmail.com
2           JaneDoe@email.com
3           JohnDoe@email.com
Name: email, dtype: object

Some people also use the dot (.) notation to access a column in dataFrame. This is same as above notation

- Reaason to use the above notation instead of the dot notation:
- Sometimes the dot notation might clash with some of the methods or attributes of the dataaFrame
- So to be on the safe side we should prefer the brackets notation
- For example: if a dataFrame has a count Column and we need to access that then it would conglict with the `count()` function of the dF object

In [52]:
df.email 

0    CoreyMSchaufer@gmail.com
1    pritamkundu771@gmail.com
2           JaneDoe@email.com
3           JohnDoe@email.com
Name: email, dtype: object

Types
- `dataFrame` -> A collection of rows and columns object  
- `Series` -> A column object

In [53]:
type(df['email'])

pandas.core.series.Series

Accessing multiple columns at the same time
- But now the object is not a `Seies` anymore
- Its rather a `dataFrame` object

In [54]:
cols = df[['lastName', 'email']]  # ❌ df['lastName', 'email'] <- Would throw a key error
print(type(cols))
cols

<class 'pandas.core.frame.DataFrame'>


Unnamed: 0,lastName,email
0,Chaufer,CoreyMSchaufer@gmail.com
1,Kundu,pritamkundu771@gmail.com
2,Doe,JaneDoe@email.com
3,Doe,JohnDoe@email.com


How do we get the columns or Column names of the dataFrame ?

In [55]:
df.columns, type(df.columns) # 🎗️ TIP: To get multiple prints without print statement, seperate statements with a comma

(Index(['firstName', 'lastName', 'email'], dtype='object'),
 pandas.core.indexes.base.Index)

How doe we get the Rows of a dataFrame ?
- `loc[]` 
- `iloc[]`

In [56]:
# since this is an indexer, use beackets to access an item
df.iloc[0]  # <- Retrieving the first row of the dF as a Series object

firstName                       Corey
lastName                      Chaufer
email        CoreyMSchaufer@gmail.com
Name: 0, dtype: object

Accessing multiple rows at the same time

In [57]:
df.iloc[[0, 1], :] # selects the first and second row from the dataFrame

Unnamed: 0,firstName,lastName,email
0,Corey,Chaufer,CoreyMSchaufer@gmail.com
1,Pritam,Kundu,pritamkundu771@gmail.com


Specifying multiple rows and columns at the same time

Schema of dataFrame:
- `Index(['firstName', 'lastName', 'email'], dtype='object')`

In [58]:
# As from the above schema, since the email is at index 2, 
# to grab only the email column of the 2 rows we pass idx 2 as a second argument
# to the iloc object index
df.iloc[[0, 1], 2]

0    CoreyMSchaufer@gmail.com
1    pritamkundu771@gmail.com
Name: email, dtype: object

-   Differences between iloc and loc

    |           Attributes |             `iloc`              |              `loc`               |
    | -------------------: | :-----------------------------: | :------------------------------: |
    |           Meaning 👉 | iloc stands for integer locator |   loc stands for label locator   |
    |        Paramenter 👉 | as name suggests, takes integer | can take both integer and label  |
    |         Accessing 👉 |        `iloc[[1, 2, 3]]`        |        `loc[['A', 'B']] `        |
    | Callable Function 👉 |     `iloc[lambda x: x[2]]`      |      `loc[lambda x: x[2]]`       |
    |           Slicing 👉 |   `iloc[m:n]`, n inc & m exc    | `loc[A:B]`, both labels included |


In [59]:
df.loc[3, 'email'] # To select only email of 1 candidate
df.loc[[1, 2], 'email'] # To select email of n candidates
df.loc[[1, 2], ['email', 'lastName']] # To select n attributes of n candidates
# 🎗️ NOTE: Notice how the order is preserved as the attribute list passed to the loc indexer

Unnamed: 0,email,lastName
1,pritamkundu771@gmail.com,Kundu
2,JaneDoe@email.com,Doe


In [60]:
df

Unnamed: 0,firstName,lastName,email
0,Corey,Chaufer,CoreyMSchaufer@gmail.com
1,Pritam,Kundu,pritamkundu771@gmail.com
2,Jane,Doe,JaneDoe@email.com
3,John,Doe,JohnDoe@email.com


dataSet Source: [Stack Overflow Annual Dev Survey 2021](https://insights.stackoverflow.com/survey)  
![source.png](attachment:source.png)

In [61]:
# ❗❗ Dont forget to extract the 88mb `survey_results_public.csv` after cloning repo from git to avoid file not found error
df = pd.read_csv(r"data/survey_results_public.csv")
# df # <- Will show 5 top and 5 bottom columns

Setting-up global attributes for pandas

In [62]:
pd.set_option('display.max_columns', 85)
pd.set_option('display.max_rows', 5) 
df

Unnamed: 0,ResponseId,MainBranch,Employment,Country,US_State,UK_Country,EdLevel,Age1stCode,LearnCode,YearsCode,YearsCodePro,DevType,OrgSize,Currency,CompTotal,CompFreq,LanguageHaveWorkedWith,LanguageWantToWorkWith,DatabaseHaveWorkedWith,DatabaseWantToWorkWith,PlatformHaveWorkedWith,PlatformWantToWorkWith,WebframeHaveWorkedWith,WebframeWantToWorkWith,MiscTechHaveWorkedWith,MiscTechWantToWorkWith,ToolsTechHaveWorkedWith,ToolsTechWantToWorkWith,NEWCollabToolsHaveWorkedWith,NEWCollabToolsWantToWorkWith,OpSys,NEWStuck,NEWSOSites,SOVisitFreq,SOAccount,SOPartFreq,SOComm,NEWOtherComms,Age,Gender,Trans,Sexuality,Ethnicity,Accessibility,MentalHealth,SurveyLength,SurveyEase,ConvertedCompYearly
0,1,I am a developer by profession,"Independent contractor, freelancer, or self-em...",Slovakia,,,"Secondary school (e.g. American high school, G...",18 - 24 years,Coding Bootcamp;Other online resources (ex: vi...,,,"Developer, mobile",20 to 99 employees,EUR European Euro,4800.0,Monthly,C++;HTML/CSS;JavaScript;Objective-C;PHP;Swift,Swift,PostgreSQL;SQLite,SQLite,,,Laravel;Symfony,,,,,,PHPStorm;Xcode,Atom;Xcode,MacOS,Call a coworker or friend;Visit Stack Overflow...,Stack Overflow,Multiple times per day,Yes,A few times per month or weekly,"Yes, definitely",No,25-34 years old,Man,No,Straight / Heterosexual,White or of European descent,None of the above,None of the above,Appropriate in length,Easy,62268.0
1,2,I am a student who is learning to code,"Student, full-time",Netherlands,,,"Bachelor’s degree (B.A., B.S., B.Eng., etc.)",11 - 17 years,"Other online resources (ex: videos, blogs, etc...",7,,,,,,,JavaScript;Python,,PostgreSQL,,,,Angular;Flask;Vue.js,,Cordova,,Docker;Git;Yarn,Git,Android Studio;IntelliJ;Notepad++;PyCharm,,Windows,Visit Stack Overflow;Google it,Stack Overflow,Daily or almost daily,Yes,Daily or almost daily,"Yes, definitely",No,18-24 years old,Man,No,Straight / Heterosexual,White or of European descent,None of the above,None of the above,Appropriate in length,Easy,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
83437,83438,I am a developer by profession,Employed full-time,Canada,,,"Bachelor’s degree (B.A., B.S., B.Eng., etc.)",11 - 17 years,Online Courses or Certification;Books / Physic...,5,3,"Developer, back-end",20 to 99 employees,CAD\tCanadian dollar,90000.0,Monthly,Bash/Shell;JavaScript;Node.js;Python,Go;Rust,Cassandra;Elasticsearch;MongoDB;PostgreSQL;Redis,,Heroku,AWS;DigitalOcean,Django;Express;Flask;React.js,,NumPy;Pandas;TensorFlow;Torch/PyTorch,NumPy;Pandas;TensorFlow;Torch/PyTorch,Ansible;Docker;Git;Terraform,Kubernetes;Terraform,PyCharm;Sublime Text,,MacOS,Call a coworker or friend;Visit Stack Overflow...,Stack Overflow,A few times per month or weekly,Yes,Less than once per month or monthly,"No, not really",No,25-34 years old,Man,No,Straight / Heterosexual,White or of European descent,None of the above,I have a mood or emotional disorder (e.g. depr...,Appropriate in length,Neither easy nor difficult,816816.0
83438,83439,I am a developer by profession,Employed full-time,Brazil,,,"Professional degree (JD, MD, etc.)",11 - 17 years,School,14,4,"Developer, front-end;Developer, full-stack;Dev...",I don’t know,BRL\tBrazilian real,7700.0,Monthly,Delphi;Elixir;HTML/CSS;Java;JavaScript,Elixir;HTML/CSS;Java;JavaScript;Node.js;PHP;SQ...,Oracle;PostgreSQL,Elasticsearch;MongoDB;MySQL;Oracle;PostgreSQL;...,Microsoft Azure,AWS,Angular;Spring,Express;Laravel;Spring;Symfony,,,Docker;Git,Docker;Git;Kubernetes,IntelliJ;Visual Studio Code,IntelliJ;PHPStorm;Visual Studio Code,Linux-based,Call a coworker or friend;Visit Stack Overflow...,Stack Overflow;Stack Exchange;Stack Overflow f...,A few times per week,Yes,A few times per week,"Yes, somewhat",No,18-24 years old,Man,No,Straight / Heterosexual,Hispanic or Latino/a/x,None of the above,None of the above,Appropriate in length,Easy,21168.0


Analyzing the schema of the dataset

In [63]:
pd.set_option('display.max_rows', 48)  # <- Sets maximum 48 rows to display
schema_df = pd.read_csv("data/survey_results_schema.csv")
schema_df.head(10) # <- to display the first 10 rows of the data

Unnamed: 0,qid,qname,question,force_resp,type,selector
0,QID16,S0,"<div><span style=""font-size:19px;""><strong>Hel...",False,DB,TB
1,QID12,MetaInfo,Browser Meta Info,False,Meta,Browser
2,QID1,S1,"<span style=""font-size:22px; font-family: aria...",False,DB,TB
3,QID2,MainBranch,Which of the following options best describes ...,True,MC,SAVR
4,QID24,Employment,Which of the following best describes your cur...,False,MC,MAVR
5,QID6,Country,"Where do you live? <span style=""font-weight: b...",True,MC,DL
6,QID7,US_State,<p>In which state or territory of the USA do y...,False,MC,DL
7,QID9,UK_Country,In which part of the United Kingdom do you liv...,False,MC,DL
8,QID190,S2,"<span style=""font-size:22px; font-family: aria...",False,DB,TB
9,QID25,EdLevel,Which of the following best describes the high...,False,MC,SAVR


In [64]:
schema_df.tail(10) # <- to display the last 10 rows of the data

Unnamed: 0,qid,qname,question,force_resp,type,selector
38,QID127,Age,What is your age?,False,MC,MAVR
39,QID122,Gender,"Which of the following describe you, if any? P...",False,MC,MAVR
40,QID153,Trans,Do you identify as transgender?,False,MC,MAVR
41,QID136,Sexuality,"Which of the following describe you, if any? P...",False,MC,MAVR
42,QID126,Ethnicity,"Which of the following describe you, if any? P...",False,MC,MAVR
43,QID124,Accessibility,"Which of the following describe you, if any? P...",False,MC,MAVR
44,QID125,MentalHealth,"Which of the following describe you, if any? P...",False,MC,MAVR
45,QID131,S6,"<span style=""font-size:22px;""><strong>Final Qu...",False,DB,TB
46,QID132,SurveyLength,How do you feel about the length of the survey...,False,MC,MAVR
47,QID133,SurveyEase,How easy or difficult was this survey to compl...,False,MC,MAVR


In [65]:
rows, columns = df.shape
df.columns

Index(['ResponseId', 'MainBranch', 'Employment', 'Country', 'US_State',
       'UK_Country', 'EdLevel', 'Age1stCode', 'LearnCode', 'YearsCode',
       'YearsCodePro', 'DevType', 'OrgSize', 'Currency', 'CompTotal',
       'CompFreq', 'LanguageHaveWorkedWith', 'LanguageWantToWorkWith',
       'DatabaseHaveWorkedWith', 'DatabaseWantToWorkWith',
       'PlatformHaveWorkedWith', 'PlatformWantToWorkWith',
       'WebframeHaveWorkedWith', 'WebframeWantToWorkWith',
       'MiscTechHaveWorkedWith', 'MiscTechWantToWorkWith',
       'ToolsTechHaveWorkedWith', 'ToolsTechWantToWorkWith',
       'NEWCollabToolsHaveWorkedWith', 'NEWCollabToolsWantToWorkWith', 'OpSys',
       'NEWStuck', 'NEWSOSites', 'SOVisitFreq', 'SOAccount', 'SOPartFreq',
       'SOComm', 'NEWOtherComms', 'Age', 'Gender', 'Trans', 'Sexuality',
       'Ethnicity', 'Accessibility', 'MentalHealth', 'SurveyLength',
       'SurveyEase', 'ConvertedCompYearly'],
      dtype='object')

To get a glimpse of how powerful the pandas library is: 
  - just have a look at the simple dataFrame method below  
  - `value_counts()` -> extracts the unique value and the number of its occurences

In [66]:
df['Country'].value_counts()
# ⭐⭐⭐ as expected proud to see, India ranks 2nd among worldwide stack overflow users 

United States of America                                15288
India                                                   10511
Germany                                                  5625
United Kingdom of Great Britain and Northern Ireland     4475
Canada                                                   3012
                                                        ...  
Saint Kitts and Nevis                                       1
Dominica                                                    1
Saint Vincent and the Grenadines                            1
Tuvalu                                                      1
Papua New Guinea                                            1
Name: Country, Length: 181, dtype: int64

To get the first person's entire survey result:

In [67]:
df.loc[0]

ResponseId                                                                      1
MainBranch                                         I am a developer by profession
Employment                      Independent contractor, freelancer, or self-em...
Country                                                                  Slovakia
US_State                                                                      NaN
UK_Country                                                                    NaN
EdLevel                         Secondary school (e.g. American high school, G...
Age1stCode                                                          18 - 24 years
LearnCode                       Coding Bootcamp;Other online resources (ex: vi...
YearsCode                                                                     NaN
YearsCodePro                                                                  NaN
DevType                                                         Developer, mobile
OrgSize         

Find Operating systems of first 3 survey results:

In [68]:
df.loc[[0, 1, 2], 'OpSys']

0      MacOS
1    Windows
2      MacOS
Name: OpSys, dtype: object

The same result can be obtained by a slicing method, similar to list slicing

In [69]:
df.loc[:2, 'OpSys'] # ❌ df.loc[[:2], 'OpSys'] <- remember to use no brackets

0      MacOS
1    Windows
2      MacOS
Name: OpSys, dtype: object

An example of column slicing with labels: Extracting from 'OpSys' to 'Accessibility'

In [70]:
df.loc[:2, 'OpSys': 'Accessibility']

Unnamed: 0,OpSys,NEWStuck,NEWSOSites,SOVisitFreq,SOAccount,SOPartFreq,SOComm,NEWOtherComms,Age,Gender,Trans,Sexuality,Ethnicity,Accessibility
0,MacOS,Call a coworker or friend;Visit Stack Overflow...,Stack Overflow,Multiple times per day,Yes,A few times per month or weekly,"Yes, definitely",No,25-34 years old,Man,No,Straight / Heterosexual,White or of European descent,None of the above
1,Windows,Visit Stack Overflow;Google it,Stack Overflow,Daily or almost daily,Yes,Daily or almost daily,"Yes, definitely",No,18-24 years old,Man,No,Straight / Heterosexual,White or of European descent,None of the above
2,MacOS,Visit Stack Overflow;Google it;Watch help / tu...,Stack Overflow;Stack Exchange,Multiple times per day,Yes,Multiple times per day,"Yes, definitely",Yes,18-24 years old,Man,No,Prefer not to say,Prefer not to say,None of the above


An example of column slicing with `steps`

In [71]:
df.loc[:3, :'Accessibility':3] 

Unnamed: 0,ResponseId,Country,EdLevel,YearsCode,OrgSize,CompFreq,DatabaseHaveWorkedWith,PlatformWantToWorkWith,MiscTechHaveWorkedWith,ToolsTechWantToWorkWith,OpSys,SOVisitFreq,SOComm,Gender,Ethnicity
0,1,Slovakia,"Secondary school (e.g. American high school, G...",,20 to 99 employees,Monthly,PostgreSQL;SQLite,,,,MacOS,Multiple times per day,"Yes, definitely",Man,White or of European descent
1,2,Netherlands,"Bachelor’s degree (B.A., B.S., B.Eng., etc.)",7.0,,,PostgreSQL,,Cordova,Git,Windows,Daily or almost daily,"Yes, definitely",Man,White or of European descent
2,3,Russian Federation,"Bachelor’s degree (B.A., B.S., B.Eng., etc.)",,,,SQLite,,NumPy;Pandas;TensorFlow;Torch/PyTorch,,MacOS,Multiple times per day,"Yes, definitely",Man,Prefer not to say
3,4,Austria,"Master’s degree (M.A., M.S., M.Eng., MBA, etc.)",,100 to 499 employees,Monthly,,,,,Windows,Daily or almost daily,Neutral,Man,White or of European descent
