[Home](https://pandas.pydata.org/pandas-docs/stable/index.html)
[What's New](https://pandas.pydata.org/pandas-docs/stable/whatsnew/v1.0.1.html)
[Getting Started](https://pandas.pydata.org/pandas-docs/stable/getting_started/index.html)
[User Guide](https://pandas.pydata.org/pandas-docs/stable/user_guide/index.html)
[API Reference](https://pandas.pydata.org/pandas-docs/stable/reference/index.html)
[Development](https://pandas.pydata.org/pandas-docs/stable/development/index.html)
[Release Notes](https://pandas.pydata.org/pandas-docs/stable/whatsnew/index.html)
***
<img src="https://pandas.pydata.org/pandas-docs/stable/_static/pandas.svg" align="left" alt="dataframe image" width = "400">
***

In [1]:
#standard import The community agreed alias for pandas is pd, so loading pandas as pd is assumed standard practice for all of the pandas documentation.

import pandas as pd
from pandas_profiling import ProfileReport


ModuleNotFoundError: No module named 'pandas_profiling'

If you have downloaded a conda distribution then you can ensure you have pandas and update it appropriately (conda install/update pandas) and or utilize (pip install pandas)

When working with tabular data, such as data stored in spreadsheets or databases, Pandas is the right tool for you when working in Jupyter Notebook or Lab. Pandas helps you to explore, clean, visualize, analyze and process your data. In Pandas, a data table is called a [DataFrame](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html#pandas.DataFrame).

<!--![dataframe](images/dataframe.png) -->
<img src="https://pandas.pydata.org/pandas-docs/stable/_images/01_table_dataframe.svg" align="left" alt="dataframe image" width = "400">

In [1]:
%%file data/data.csv
Date,Open,High,Low,Close,Volume,Adj Close
2012-06-01,569.16,590.00,548.50,584.00,14077000,581.50
2012-05-01,584.90,596.76,522.18,577.73,18827900,575.26
2012-04-02,601.83,644.00,555.00,583.98,28759100,581.48
2012-03-01,548.17,621.45,516.22,599.55,26486000,596.99
2012-02-01,458.41,547.61,453.98,542.44,22001000,540.12
2012-01-03,409.40,458.24,409.00,456.48,12949100,454.53

Overwriting data/data.csv


In [2]:
!ls data

AI_Tools.csv             data.csv                 list-skus.json
BankChurners.csv         gov_domains.csv          list-skus2.json
BankChurners_clean.csv   kernels.csv              melb_data.csv
categorical.pkl          list-skus.csv            rdu-weather-history.json


In [4]:
#set max columns
pd.set_option('display.max_columns', None)

Read this as into a DataFrame:

In [5]:
#Windows
#df = pd.read_csv(data\gov_domains)

#Mac
df = pd.read_csv('data/gov_domains.csv')

In [6]:
profile = ProfileReport(df, title='Government Domains Profiling Report', explorative=True)

In [7]:
df

Unnamed: 0,Domain Name,Domain Type,Agency,Organization,City,State,Security Contact Email
0,ABERDEENMD.GOV,City,Non-Federal Agency,City of Aberdeen,Aberdeen,MD,(blank)
1,ABERDEENWA.GOV,City,Non-Federal Agency,City of Aberdeen,Aberdeen,WA,(blank)
2,ABILENETX.GOV,City,Non-Federal Agency,City of Abilene,Abilene,TX,(blank)
3,ABINGDON-VA.GOV,City,Non-Federal Agency,Town of Abingdon,Abingdon,VA,(blank)
4,ABINGTONMA.GOV,City,Non-Federal Agency,Town of Abington,Abington,MA,wnorling@abingtonma.gov
...,...,...,...,...,...,...,...
6230,WYWINDFALL.GOV,State/Local Govt,Non-Federal Agency,State Of Wyoming,Cheyenne,WY,ets-security@wyo.gov
6231,YARROWPOINTWA.GOV,State/Local Govt,Non-Federal Agency,Town of Yarrow Point,Yarrow Point,WA,(blank)
6232,YORKSC.GOV,State/Local Govt,Non-Federal Agency,City of York,York,SC,(blank)
6233,ZERODEATHSMD.GOV,State/Local Govt,Non-Federal Agency,Maryland Highway Safety Office,Glen Burnie,MD,dkopke@mdot.maryland.gov


In [8]:
profile

Summarize dataset:   0%|          | 0/20 [00:00<?, ?it/s]

Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]

Render HTML:   0%|          | 0/1 [00:00<?, ?it/s]



Each column in a DataFrame is a Series!

In [9]:
#to call just one column and put it into a series
df['Domain Name']

0            ABERDEENMD.GOV
1            ABERDEENWA.GOV
2             ABILENETX.GOV
3           ABINGDON-VA.GOV
4            ABINGTONMA.GOV
               ...         
6230         WYWINDFALL.GOV
6231      YARROWPOINTWA.GOV
6232             YORKSC.GOV
6233       ZERODEATHSMD.GOV
6234    ZEROINWISCONSIN.GOV
Name: Domain Name, Length: 6235, dtype: object

In [10]:
#to call a row of data use .loc[index]
df.loc[3]

Domain Name                  ABINGDON-VA.GOV
Domain Type                             City
Agency                    Non-Federal Agency
Organization                Town of Abingdon
City                                Abingdon
State                                     VA
Security Contact Email               (blank)
Name: 3, dtype: object

In [11]:
#to call just one column and statistically describe
df['Domain Name'].describe()

count                   6235
unique                  6235
top       CITYOFROGERSTX.GOV
freq                       1
Name: Domain Name, dtype: object

In [12]:
#df['State'].where(df['Domain Name'] == 'ai.mil')
df1 = df['Domain Name'].str.contains(".GOV") 
print(df1)

0       True
1       True
2       True
3       True
4       True
        ... 
6230    True
6231    True
6232    True
6233    True
6234    True
Name: Domain Name, Length: 6235, dtype: bool


Pandas supports the integration with many file formats or data sources out of the box (csv, excel, sql, json, parquet,html, hdf5). Importing data from each of these data sources is provided by function with the prefix read_*. Similarly, the to_* methods are used to store data. see supported list: https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html

<img src="https://pandas.pydata.org/pandas-docs/stable/_images/02_io_readwrite.svg" alt="dataframe image" width = "600">


In [13]:
df2 = pd.read_csv('data/AI_Tools.csv')

In [14]:
df2.head()

Unnamed: 0,Tool/Capability,Description,Capability Type
0,Adobe Photoshop,Adobe Photoshop is a raster graphics editor.,Graphic editing
1,Amazon Backup,Fully managed backup service that makes it eas...,AWS Tools
2,Amazon Comprehend,Amazon ComprehendÂ is a natural language proces...,AWS Tools
3,Amazon DynamoDB,Amazon DynamoDB is a fully managed proprietary...,AWS Tools
4,Amazon Elastic Block Storage,"Persistent, durable, low-latency block-level s...",AWS Tools


In [15]:
list(df2.columns)

['Tool/Capability', 'Description', 'Capability Type']

In [16]:
df4 = df2[['Tool/Capability', 'Capability Type' ]]

In [17]:
df4.groupby(['Capability Type']).count()

Unnamed: 0_level_0,Tool/Capability
Capability Type,Unnamed: 1_level_1
AI/ML Tools,58
AWS Tools,33
Automation/Infrastructure Tools,34
Azure Tools,21
Collaboration Tools,3
Data Ingestion/Manipulation,23
Data Visualization,9
Database,14
DevSecOps Tools,30
Docker Tools,5


In [2]:
#You would be hard pressed to find any place better to get federal data and integrate into your project. Data.gov
import webbrowser
datagov = 'https://www.data.gov/'
webbrowser.open(datagov)

True

In [19]:
#https://catalog.data.gov/dataset/local-weather-archive Town of Cary, North Carolina
!ls data

AI_Tools.csv             data.csv                 list-skus.json
BankChurners.csv         gov_domains.csv          list-skus2.json
BankChurners_clean.csv   kernels.csv              melb_data.csv
categorical.pkl          list-skus.csv            rdu-weather-history.json


In [20]:
weatherdf = pd.read_json('data/rdu-weather-history.json')

In [21]:
list(weatherdf.columns)

['fogground',
 'snowfall',
 'dust',
 'snowdepth',
 'mist',
 'drizzle',
 'hail',
 'fastest2minwindspeed',
 'thunder',
 'glaze',
 'snow',
 'ice',
 'fog',
 'temperaturemin',
 'fastest5secwindspeed',
 'freezingfog',
 'temperaturemax',
 'blowingsnow',
 'freezingrain',
 'rain',
 'highwind',
 'date',
 'precipitation',
 'fogheavy',
 'smokehaze',
 'avgwindspeed',
 'fastest2minwinddir',
 'fastest5secwinddir']

In [22]:
weatherdf.head()

Unnamed: 0,fogground,snowfall,dust,snowdepth,mist,drizzle,hail,fastest2minwindspeed,thunder,glaze,snow,ice,fog,temperaturemin,fastest5secwindspeed,freezingfog,temperaturemax,blowingsnow,freezingrain,rain,highwind,date,precipitation,fogheavy,smokehaze,avgwindspeed,fastest2minwinddir,fastest5secwinddir
0,No,0.0,No,0.0,Yes,No,No,17.9,No,No,No,No,Yes,50.0,21.92,No,71.1,No,No,Yes,No,2007-01-06,0.13,No,No,8.05,230.0,230.0
1,No,0.0,No,0.0,No,No,No,23.04,No,No,No,No,No,30.0,29.08,No,55.0,No,No,Yes,No,2007-01-09,0.0,No,No,7.61,280.0,270.0
2,No,0.0,No,0.0,No,No,No,23.94,No,No,No,No,No,57.0,29.08,No,73.9,No,No,No,No,2007-01-15,0.0,No,No,13.2,230.0,230.0
3,No,0.0,No,0.0,Yes,Yes,No,8.05,No,No,No,No,Yes,33.1,12.08,No,41.0,No,No,Yes,No,2007-01-22,0.08,No,No,2.01,230.0,10.0
4,No,0.0,No,0.0,No,No,No,17.9,No,No,No,No,No,24.1,21.03,No,48.9,No,No,No,No,2007-01-30,0.0,No,No,5.82,220.0,220.0


In [23]:
weatherdf

Unnamed: 0,fogground,snowfall,dust,snowdepth,mist,drizzle,hail,fastest2minwindspeed,thunder,glaze,snow,ice,fog,temperaturemin,fastest5secwindspeed,freezingfog,temperaturemax,blowingsnow,freezingrain,rain,highwind,date,precipitation,fogheavy,smokehaze,avgwindspeed,fastest2minwinddir,fastest5secwinddir
0,No,0.0,No,0.0,Yes,No,No,17.90,No,No,No,No,Yes,50.0,21.92,No,71.1,No,No,Yes,No,2007-01-06,0.13,No,No,8.05,230.0,230.0
1,No,0.0,No,0.0,No,No,No,23.04,No,No,No,No,No,30.0,29.08,No,55.0,No,No,Yes,No,2007-01-09,0.00,No,No,7.61,280.0,270.0
2,No,0.0,No,0.0,No,No,No,23.94,No,No,No,No,No,57.0,29.08,No,73.9,No,No,No,No,2007-01-15,0.00,No,No,13.20,230.0,230.0
3,No,0.0,No,0.0,Yes,Yes,No,8.05,No,No,No,No,Yes,33.1,12.08,No,41.0,No,No,Yes,No,2007-01-22,0.08,No,No,2.01,230.0,10.0
4,No,0.0,No,0.0,No,No,No,17.90,No,No,No,No,No,24.1,21.03,No,48.9,No,No,No,No,2007-01-30,0.00,No,No,5.82,220.0,220.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4839,No,0.0,No,0.0,No,No,No,23.94,No,No,No,No,No,37.0,33.11,No,69.1,No,No,No,No,2020-03-09,0.00,No,No,11.41,230.0,230.0
4840,No,0.0,No,0.0,No,No,No,17.00,Yes,No,No,No,Yes,45.0,25.05,No,57.0,No,No,No,No,2020-03-25,0.86,No,No,7.16,280.0,280.0
4841,No,0.0,No,0.0,No,No,No,14.99,No,No,No,No,No,46.9,21.03,No,61.0,No,No,No,No,2020-03-26,0.00,No,No,5.14,230.0,230.0
4842,No,0.0,No,0.0,No,No,No,17.00,No,No,No,No,No,52.0,23.94,No,81.0,No,No,No,No,2020-03-27,0.00,No,No,8.50,240.0,230.0


In [24]:
#You can also use boolean to ask questions
snowed = weatherdf['snowfall'] > 0

In [25]:
snowed

0       False
1       False
2       False
3       False
4       False
        ...  
4839    False
4840    False
4841    False
4842    False
4843    False
Name: snowfall, Length: 4844, dtype: bool

In [26]:
weatherdf.snowfall.value_counts()

0.00    4793
0.20       7
0.39       6
0.31       4
0.51       3
0.71       3
1.42       3
0.12       3
0.98       3
0.91       2
1.18       2
1.89       2
0.59       2
3.19       2
3.31       1
6.69       1
2.52       1
7.01       1
3.50       1
5.91       1
0.79       1
3.58       1
Name: snowfall, dtype: int64

In [27]:
weatherdf.precipitation.value_counts()

0.00    3283
0.01     131
0.02      79
0.03      65
0.04      53
        ... 
2.17       1
2.50       1
1.87       1
3.31       1
1.71       1
Name: precipitation, Length: 189, dtype: int64

In [14]:
import numpy as np
from pandas import Series, DataFrame

In [2]:
#Quicklearn DataFrames

#Let's get some data to play with. How about the Jupyter Kernels github
import webbrowser
website = 'https://github.com/jupyter/jupyter/wiki/Jupyter-kernels'
webbrowser.open(website)

True

In [3]:
#Copy and read to get data
kernels = pd.read_clipboard()

In [4]:
kernels

Unnamed: 0,Name,Jupyter/IPython Version,Language(s) Version,3rd party dependencies,Example Notebooks,Notes
0,D-lang,Jupyter,DMD,,,
1,Micronaut,,"Python>=3.7.5, Groovy>3",Micronaut,https://github.com/stainlessai/micronaut-jupyt...,Compatible with BeakerX
2,Agda kernel,,2.6.0,,https://mybinder.org/v2/gh/lclem/agda-kernel/m...,
3,Dyalog Jupyter Kernel,,APL (Dyalog),Dyalog >= 15.0,Notebooks,Can also be run on TryAPL's Learn tab
4,Coarray-Fortran,Jupyter 4.0,Fortran 2008/2015,"GFortran >= 7.1, OpenCoarrays, MPICH >= 3.2","Demo, Binder demo",Docker image
...,...,...,...,...,...,...
148,.Net Interactive,Jupyter 4,"C#, F#, Powershell",.Net Core SDK,Binder Examples,
149,mariadb_kernel,Jupyter Notebook/Lab,SQL,"Internal Dependencies, MariaDB Server",Binder notebook,A Jupyter kernel for the MariaDB Open Source d...
150,ISetlX,Jupyter,SetlX,,Example,
151,Ganymede,Jupyter >= 4.0,"Java 11+, Groovy, Javascript, Kotlin, Scala, A...","JShell, Apache Maven Resolver",Examples,


In [5]:
#write to file to save your dataframe as a csv
kernels.to_csv(r'data/kernels.csv')

In [19]:
!ls data

AI_Tools.csv             data.csv                 list-skus.json
BankChurners.csv         gov_domains.csv          list-skus2.json
BankChurners_clean.csv   kernels.csv              melb_data.csv
categorical.pkl          list-skus.csv            rdu-weather-history.json


In [24]:
# We can view the column names with .columns
kernels.columns

Index(['Name', 'Jupyter/IPython Version', 'Language(s) Version',
       '3rd party dependencies', 'Example Notebooks', 'Notes'],
      dtype='object')

In [25]:
#View specific data columns
DataFrame(kernels,columns=['Name','Language(s) Version','Notes'])

NameError: name 'DataFrame' is not defined

In [22]:
#We can view/retrieive individual columns
kernels['Name']

0                     D-lang
1                  Micronaut
2                Agda kernel
3      Dyalog Jupyter Kernel
4            Coarray-Fortran
               ...          
147                  IQSharp
148         .Net Interactive
149           mariadb_kernel
150                   ISetlX
151                 Ganymede
Name: Name, Length: 152, dtype: object

In [23]:
#We can retrieve  a specific row through indexing
kernels.loc[10]

Name                             IJulia
Jupyter/IPython Version             NaN
Language(s) Version        julia >= 0.3
3rd party dependencies              NaN
Example Notebooks                   NaN
Notes                               NaN
Name: 10, dtype: object

In [24]:
#Can also view an index range
kernels[20:30]

Unnamed: 0,Name,Jupyter/IPython Version,Language(s) Version,3rd party dependencies,Example Notebooks,Notes
20,SageMath,Jupyter 4,Any,many,,
21,pari_jupyter,Jupyter 4,PARI/GP >= 2.9,,,
22,IFSharp,Jupyter 4,F#,,Features,
23,lgo,"Jupyter >= 4, JupyterLab",Go >= 1.8,ZeroMQ (4.x),Example,Docker image
24,iGalileo,"Jupyter >= 4, JupyterLab",Galileo >= 0.1.3,,,Docker image
25,gopherlab,"Jupyter 4.1, JupyterLab",Go >= 1.6,ZeroMQ (4.x),examples,"Deprecated, use gophernotes"
26,Gophernotes,"Jupyter 4, JupyterLab, nteract",Go >= 1.9,ZeroMQ 4.x.x,examples,docker image
27,IGo,,Go >= 1.4,,,"Unmaintained, use gophernotes"
28,IScala,,Scala,,,
29,almond (old name: Jupyter-scala),IPython>=3.0,Scala>=2.10,,examples,Docs


In [7]:
#Another way to select and view multiple columns
kernals2 = kernels[['Name','Language(s) Version']]

NameError: name 'kernels' is not defined

In [26]:
kernals2

Unnamed: 0,Name,Language(s) Version
0,D-lang,DMD
1,Micronaut,"Python>=3.7.5, Groovy>3"
2,Agda kernel,2.6.0
3,Dyalog Jupyter Kernel,APL (Dyalog)
4,Coarray-Fortran,Fortran 2008/2015
...,...,...
147,IQSharp,Q#
148,.Net Interactive,"C#, F#, Powershell"
149,mariadb_kernel,SQL
150,ISetlX,SetlX


In [27]:
#You can also use boolean to ask questions
kernels[kernels['Language(s) Version']=='NodeJs 12']

Unnamed: 0,Name,Jupyter/IPython Version,Language(s) Version,3rd party dependencies,Example Notebooks,Notes
140,nelu-kernelu,Jupyter,NodeJs 12,NodeJs 12.3+,Examples,An advanced NodeJs Jupyter kernel supporting c...


In [28]:
kernels.count(axis='rows')

Name                       152
Jupyter/IPython Version    105
Language(s) Version        148
3rd party dependencies      68
Example Notebooks           73
Notes                       70
dtype: int64

In [4]:
#read excel
train_accidents = pd.read_excel('data/train_accidents.xls')

In [6]:
train_accidents.head()

Unnamed: 0,Acident,Railroad,Month,Day,State,County,TrkType,TrkMnt,AccType,AccCause,EqpDamg,TrkDamg,Killed,Injured,RREquip,Speed,LocosDer,CarsDer
0,1,NS,1,1,KY,KENTON,Main,NS,Der,T,8485,333700,0,0,FREIGHT TRAIN,24,0,13
1,2,ATK,1,2,CA,ALAMEDA,Yard,ATK,Oth,M,1500000,0,0,0,PASSENGER TRAIN,0,0,0
2,3,BNSF,1,2,MT,DAWSON,Yard,BNSF,Der,H,103833,17615,0,0,FREIGHT TRAIN,10,0,6
3,4,BNSF,1,2,OK,TULSA,Yard,BNSF,Der,M,15000,50000,0,0,LIGHT LOCO(S),4,5,0
4,5,MNCW,1,2,CT,FAIRFIELD,Main,MNCW,Oth,E,9964,750,0,0,PASSENGER TRAIN,45,0,0
