**02: Dataframe Basics - Creating Dataframes, Pandas Series Basics**
- Note that dataframes are essentially tables, confusing terms like 2-dimensional dataframes basically means rows and columns in a table, no need to overcomplicate it

In [1]:
import pandas as pd

***
Creating a DataFrame:
- Dataframes are often created using dictionaries, keys == column headings, values == data. Dataframe indexes (i.e number of rows) correspond to the number of values.
- The dictionary used must have lists of all the same lengths, i.e no empty values (instead replace with None / N.A etc)

In [4]:
students = {
    'names': ['Tom', 'Bob', 'Jane', 'May'],
    'age': [9, 10, 10, 9],
    'subjects': ['Science', 'Arts', 'Hybrid', 'Arts'],
    'award winner': [True, False, False, True]
}

In [6]:
df = pd.DataFrame(students)
df
#dictionary containing 'data'. this data is converted into a table (dataframe) using pd.DataFrame

Unnamed: 0,names,age,subjects,award winner
0,Tom,9,Science,True
1,Bob,10,Arts,False
2,Jane,10,Hybrid,False
3,May,9,Arts,True


***
Manipulating Columns:
- Rather similar to accessing keys in a dictionary
- Note the data returned is NOT a simple Python list, but the pd object Series, a 'one-dimensional array'
- The two-dimensional dataframe object can be thought of as a container for multiple one-dimensional Series objects
- Accessing multiple columns returns a dataframe

In [9]:
df['names']
#Accessing a column in the dataframe, using the column header (i.e key)

0     Tom
1     Bob
2    Jane
3     May
Name: names, dtype: object

In [11]:
df[['names', 'subjects']]
#Accessing multiple columns, note the lists of keys have to be nested in another pair of brackets.

Unnamed: 0,names,subjects
0,Tom,Science
1,Bob,Arts
2,Jane,Hybrid
3,May,Arts


In [13]:
df.columns
#returns a list of columns

Index(['names', 'age', 'subjects', 'award winner'], dtype='object')

***
Manipulating Rows:
- Similarly returns a series of data of the index passed
- Multiple rows accessed returns a dataframe instead

In [16]:
df.iloc[0]
#iloc - integer location (of a row).

names               Tom
age                   9
subjects        Science
award winner       True
Name: 0, dtype: object

In [18]:
df.iloc[[0,3]]
#Accessing multiple rows. Similarly, list of keys have to be nested.
#Reason for a list of keys is passing them unnested specifies coordinates (x, y), rather than a series of rows.

Unnamed: 0,names,age,subjects,award winner
0,Tom,9,Science,True
3,May,9,Arts,True


***
Selecting specific values/series/arrays:
- Similar to finding 'coordinates' of the values. [row, column]
- For iloc, these are indexes, so 1st position = 0
- For loc, both use the actual names of the indexes/columns. for unlabelled dataframes, indexes are functionally the same as iloc, but with labels, loc uses the labels instead of index.
- Accessing multiple rows and columns returns either series or dataframes, depending on whether the data returned is one-dimensional or two-dimensional.

In [21]:
df.iloc[2, 1]
#3rd row, 2nd column i.e 'age'. returns a value

10

In [23]:
df.iloc[[1,2,3],[2,3]]
#it is also possible to access multiple rows, and multiple columns.
#rows -- [1,2,3], columns -- [2,3]

Unnamed: 0,subjects,award winner
1,Arts,False
2,Hybrid,False
3,Arts,True


In [25]:
df.loc[[1,2],['names','award winner']]
#two-dimensional dataset that returns a dataframe

Unnamed: 0,names,award winner
1,Bob,False
2,Jane,False


***
Real Data Section:
- Grabbing columns/rows
- Grabbing values
- Important to note when slicing values using loc, do not nest these within brackets (no longer a list of values being passed).
- Last value is INCLUSIVE when slicing with pandas

In [28]:
response_df = pd.read_csv('data.csv')
schema_df = pd.read_csv('schema_data.csv')

In [29]:
pd.set_option('display.max_rows', 118)

In [30]:
response_df['CodingActivities']
#preview - head + tail. returns a series since this is one-dimensional data

0                                                    Hobby
1        Hobby;Contribute to open-source projects;Other...
2        Hobby;Contribute to open-source projects;Other...
3                                                      NaN
4                                                      NaN
                               ...                        
65432                        Hobby;School or academic work
65433             Hobby;Contribute to open-source projects
65434                                                Hobby
65435    Hobby;Contribute to open-source projects;Profe...
65436                                                  NaN
Name: CodingActivities, Length: 65437, dtype: object

In [34]:
response_df.loc[0]
#Grabbing response from the first respondent

ResponseId                                                                        1
MainBranch                                           I am a developer by profession
Age                                                              Under 18 years old
Employment                                                      Employed, full-time
RemoteWork                                                                   Remote
Check                                                                        Apples
CodingActivities                                                              Hobby
EdLevel                                                   Primary/elementary school
LearnCode                                                    Books / Physical media
LearnCodeOnline                                                                 NaN
TechDoc                                                                         NaN
YearsCode                                                                   

In [36]:
response_df.loc[0:20, 'EdLevel']
#Grabbing EdLevel from first 21 respondents (slicing x:y including y)
#recall that slicing does not involve nesting

0                             Primary/elementary school
1          Bachelor’s degree (B.A., B.S., B.Eng., etc.)
2       Master’s degree (M.A., M.S., M.Eng., MBA, etc.)
3     Some college/university study without earning ...
4     Secondary school (e.g. American high school, G...
5                             Primary/elementary school
6        Professional degree (JD, MD, Ph.D, Ed.D, etc.)
7     Secondary school (e.g. American high school, G...
8        Professional degree (JD, MD, Ph.D, Ed.D, etc.)
9       Master’s degree (M.A., M.S., M.Eng., MBA, etc.)
10         Bachelor’s degree (B.A., B.S., B.Eng., etc.)
11       Professional degree (JD, MD, Ph.D, Ed.D, etc.)
12         Bachelor’s degree (B.A., B.S., B.Eng., etc.)
13         Bachelor’s degree (B.A., B.S., B.Eng., etc.)
14      Master’s degree (M.A., M.S., M.Eng., MBA, etc.)
15    Some college/university study without earning ...
16                            Primary/elementary school
17         Bachelor’s degree (B.A., B.S., B.Eng.

In [38]:
response_df['CodingActivities'].value_counts()
#advanced method returning a series of integers of occurences of a value

CodingActivities
Hobby                                                                                                                                                                                                            9993
I don’t code outside of work                                                                                                                                                                                     6508
Hobby;Professional development or self-paced learning from online courses                                                                                                                                        6203
Hobby;Contribute to open-source projects                                                                                                                                                                         3732
Professional development or self-paced learning from online courses                                                            