# The Pandas library

**From the Pandas documentation:**

**pandas** is everyone's favorite data analyis library providing fast, flexible, and expressive data structures designed to work with *relational* or table-like data (SQL table or Excel spreadsheet). It is a fundamental high-level building block for doing practical, real world data analysis in Python.

In [157]:
# The importing convention
import pandas as pd

# Opening (Reading) Files


## Reading Excel File

In [158]:
filepath = "data\stock_data_simple.xlsx"

In [159]:
df = pd.read_excel(filepath)

## Reading CSV File

In [160]:
filepath = "data\stock_data_simple.csv"

In [161]:
df = pd.read_csv(filepath)

# View/Display data

In [162]:
df.head()

Unnamed: 0,symbol,name,sector,price,price_chg,price__chg,vol_rate,avg_dly__vol_000,mkt_val_mil_usd,pe_ratio,company_description,industry_sector,industry_group,isin,major_industry,headquarters,exchange_primary_listing,exchange,trading_country
0,APC.DE,Apple (Fra),Technology,410.0,0.0,0.0,-100.0,1765,497751,14.0,"APPLE INC. (APPLE) DESIGNS, MANUFACTURES AND M...",COMPUTER,Computer-Hardware/Perip,US0378331005,,,"No (NASDAQ, AAPL)",FRANKFURT,GERMANY
1,APCX.DE,Apple (Xet),Technology,408.5,-2.0,-0.487211,-38.011541,6159,495930,14.0,"APPLE INC. (APPLE) DESIGNS, MANUFACTURES AND M...",COMPUTER,Computer-Hardware/Perip,US0378331005,,,"No (NASDAQ, AAPL)",XETRA,GERMANY
2,AAPL,Apple Inc,Technology,554.25,-3.11,-0.557987,-25.368715,6099413,494697,14.0,"MANUFACTURES PERSONAL COMPUTERS, MOBILE COMMUN...",COMPUTER,Computer-Hardware/Perip,US0378331005,Computer Data Storage,"Cupertino, CA",Yes,NASDAQ,UNITED STATES
3,XONA.DE,Exxon Mobil (Fra),Energy,73.2,0.0,0.0,-100.0,322,435010,13.0,EXXON MOBIL CORPORATION IS A MANUFACTURER AND ...,ENERGY,Oil&Gas-Integrated,US30231G1022,,,"No (NYSE, XOM)",FRANKFURT,GERMANY
4,WBK,Westpac Banking Corp Adr,Financial,27.99,-0.37,-1.304654,-8.477119,3780,432467,14.0,AUSTRALIAN BANK PROVIDING BANKING/RELATED FINA...,BANKS,Banks-Foreign,US9612143019,Bank-Money Center,AUSTRALIA,Yes,NYSE,UNITED STATES


In [163]:
df.tail()

Unnamed: 0,symbol,name,sector,price,price_chg,price__chg,vol_rate,avg_dly__vol_000,mkt_val_mil_usd,pe_ratio,company_description,industry_sector,industry_group,isin,major_industry,headquarters,exchange_primary_listing,exchange,trading_country
494,DTEX.DE,Deutsche Telekom (Xet),Technology,12.45,-0.195,-1.542722,4.413403,124042,75343,23.0,DEUTSCHE TELEKOM AG IS A GERMANY-BASED INTEGRA...,TELECOM,Telecom Svcs-Integrated,DE0005557508,Telecommunication Equip,,Yes,XETRA,GERMANY
495,BG.GB,Bg Group,Energy,13.52,0.17,1.273885,58.304817,49949,75310,16.0,BG GROUP PLC (BG GROUP) IS A NATURAL GAS COMPA...,ENERGY,Oil&Gas-Integrated,GB0008762899,,Reading,Yes,LONDON,UNITED KINGDOM
496,FOXA,Twenty-First Cen Fx Cl A,Consumer Cyclical,32.47,-0.14,-0.429316,-26.565262,354171,75153,25.0,GLOBAL MEDIA AND ENTERTAINMENT COMPANY ENGAGED...,MEDIA,Media-Diversified,US90130A1016,Media - Radio/Tv,"New York, NY",Yes,NASDAQ,UNITED STATES
497,ANZ.AU,Aus.And Nz.Banking Gp.,Financial,31.09,0.12,0.387472,-9.287592,169869,75132,13.0,AUSTRALIA AND NEW ZEALAND BANKING GROUP LIMITE...,BANKS,Banks-Money Center,AU000000ANZ3,Bank-Money Center,Melbourne,Yes,AUSTRALIAN,AUSTRALIA
498,DEUT.IT,Deutsche Telekom (Mil),Technology,12.4,-0.17,-1.352426,-42.570255,193,75071,23.0,DEUTSCHE TELEKOM AG IS A GERMANY-BASED INTEGRA...,TELECOM,Telecom Svcs-Integrated,DE0005557508,Telecommunication Equip,,"No (XETRA, DTEX.DE)",MILAN,ITALY


# The Pandas DataFrames

<img src="img/dataframe.png"><img src="img/excel_table.png">

If the structure seems familiar, it's because DataFrames are very similar to a single Excel “Sheet”, but instead of referring to rows and columns with A1, yyou have the column numbers/names and row numbers.  

A DataFrame consists on three parts:

1. Index
2. Columns Names (Column Index)
3. Data

In [164]:
df.index

RangeIndex(start=0, stop=499, step=1)

In [165]:
df.columns

Index(['symbol', 'name', 'sector', 'price', 'price_chg', 'price__chg',
       'vol_rate', 'avg_dly__vol_000', 'mkt_val_mil_usd', 'pe_ratio',
       'company_description', 'industry_sector', 'industry_group', 'isin',
       'major_industry', 'headquarters', 'exchange_primary_listing',
       'exchange', 'trading_country'],
      dtype='object')

In [166]:
df.values

array([['APC.DE', 'Apple (Fra)', 'Technology', ..., 'No (NASDAQ, AAPL)',
        'FRANKFURT', 'GERMANY'],
       ['APCX.DE', 'Apple (Xet)', 'Technology', ..., 'No (NASDAQ, AAPL)',
        'XETRA', 'GERMANY'],
       ['AAPL', 'Apple Inc', 'Technology', ..., 'Yes', 'NASDAQ',
        'UNITED STATES'],
       ...,
       ['FOXA', 'Twenty-First Cen Fx Cl A', 'Consumer Cyclical', ...,
        'Yes', 'NASDAQ', 'UNITED STATES'],
       ['ANZ.AU', 'Aus.And Nz.Banking Gp.', 'Financial', ..., 'Yes',
        'AUSTRALIAN', 'AUSTRALIA'],
       ['DEUT.IT', 'Deutsche Telekom (Mil)', 'Technology', ...,
        'No (XETRA, DTEX.DE)', 'MILAN', 'ITALY']], dtype=object)

## Sort Data

In [167]:
df.sort_values("mkt_val_mil_usd", ascending=False)

Unnamed: 0,symbol,name,sector,price,price_chg,price__chg,vol_rate,avg_dly__vol_000,mkt_val_mil_usd,pe_ratio,company_description,industry_sector,industry_group,isin,major_industry,headquarters,exchange_primary_listing,exchange,trading_country
0,APC.DE,Apple (Fra),Technology,410.00,0.000,0.000000,-100.000000,1765,497751,14.0,"APPLE INC. (APPLE) DESIGNS, MANUFACTURES AND M...",COMPUTER,Computer-Hardware/Perip,US0378331005,,,"No (NASDAQ, AAPL)",FRANKFURT,GERMANY
1,APCX.DE,Apple (Xet),Technology,408.50,-2.000,-0.487211,-38.011541,6159,495930,14.0,"APPLE INC. (APPLE) DESIGNS, MANUFACTURES AND M...",COMPUTER,Computer-Hardware/Perip,US0378331005,,,"No (NASDAQ, AAPL)",XETRA,GERMANY
2,AAPL,Apple Inc,Technology,554.25,-3.110,-0.557987,-25.368715,6099413,494697,14.0,"MANUFACTURES PERSONAL COMPUTERS, MOBILE COMMUN...",COMPUTER,Computer-Hardware/Perip,US0378331005,Computer Data Storage,"Cupertino, CA",Yes,NASDAQ,UNITED STATES
3,XONA.DE,Exxon Mobil (Fra),Energy,73.20,0.000,0.000000,-100.000000,322,435010,13.0,EXXON MOBIL CORPORATION IS A MANUFACTURER AND ...,ENERGY,Oil&Gas-Integrated,US30231G1022,,,"No (NYSE, XOM)",FRANKFURT,GERMANY
4,WBK,Westpac Banking Corp Adr,Financial,27.99,-0.370,-1.304654,-8.477119,3780,432467,14.0,AUSTRALIAN BANK PROVIDING BANKING/RELATED FINA...,BANKS,Banks-Foreign,US9612143019,Bank-Money Center,AUSTRALIA,Yes,NYSE,UNITED STATES
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
494,DTEX.DE,Deutsche Telekom (Xet),Technology,12.45,-0.195,-1.542722,4.413403,124042,75343,23.0,DEUTSCHE TELEKOM AG IS A GERMANY-BASED INTEGRA...,TELECOM,Telecom Svcs-Integrated,DE0005557508,Telecommunication Equip,,Yes,XETRA,GERMANY
495,BG.GB,Bg Group,Energy,13.52,0.170,1.273885,58.304817,49949,75310,16.0,BG GROUP PLC (BG GROUP) IS A NATURAL GAS COMPA...,ENERGY,Oil&Gas-Integrated,GB0008762899,,Reading,Yes,LONDON,UNITED KINGDOM
496,FOXA,Twenty-First Cen Fx Cl A,Consumer Cyclical,32.47,-0.140,-0.429316,-26.565262,354171,75153,25.0,GLOBAL MEDIA AND ENTERTAINMENT COMPANY ENGAGED...,MEDIA,Media-Diversified,US90130A1016,Media - Radio/Tv,"New York, NY",Yes,NASDAQ,UNITED STATES
497,ANZ.AU,Aus.And Nz.Banking Gp.,Financial,31.09,0.120,0.387472,-9.287592,169869,75132,13.0,AUSTRALIA AND NEW ZEALAND BANKING GROUP LIMITE...,BANKS,Banks-Money Center,AU000000ANZ3,Bank-Money Center,Melbourne,Yes,AUSTRALIAN,AUSTRALIA


## Selecting Data (by Columns)

In [168]:
df['symbol']

0       APC.DE
1      APCX.DE
2         AAPL
3      XONA.DE
4          WBK
        ...   
494    DTEX.DE
495      BG.GB
496       FOXA
497     ANZ.AU
498    DEUT.IT
Name: symbol, Length: 499, dtype: object

OR, can use column name:

In [169]:
df.symbol

0       APC.DE
1      APCX.DE
2         AAPL
3      XONA.DE
4          WBK
        ...   
494    DTEX.DE
495      BG.GB
496       FOXA
497     ANZ.AU
498    DEUT.IT
Name: symbol, Length: 499, dtype: object

In [170]:
# Getting more than one column
df[['symbol', 'name','sector']]

Unnamed: 0,symbol,name,sector
0,APC.DE,Apple (Fra),Technology
1,APCX.DE,Apple (Xet),Technology
2,AAPL,Apple Inc,Technology
3,XONA.DE,Exxon Mobil (Fra),Energy
4,WBK,Westpac Banking Corp Adr,Financial
...,...,...,...
494,DTEX.DE,Deutsche Telekom (Xet),Technology
495,BG.GB,Bg Group,Energy
496,FOXA,Twenty-First Cen Fx Cl A,Consumer Cyclical
497,ANZ.AU,Aus.And Nz.Banking Gp.,Financial


# Selecting Data

## Label vs. Location

### df.loc[row_labels, column_labels] - select data (rows and/or columns) with particular label(s).

Allowed inputs are:

* An integer, e.g. 5.

* A list or array of integers, e.g. [4, 3, 0].

* A slice object with ints, e.g. 1:7.

* A boolean array.

### df.iloc[row_positions, column_positions] - select data (rows and/or columns) at integer locations.

Allowed inputs are:

* A single label, e.g. 5 or 'a', (note that 5 is interpreted as a label of the index, and never as an integer position along the index).

* A list or array of labels, e.g. ['a', 'b', 'c'].

* A slice object with labels, e.g. 'a':'f'.

* A boolean array of the same length as the axis being sliced, e.g. [True, False, True].


### Selecting a single row by position
Selecting a single row by position

In [171]:
# selects series
df.loc[3]

symbol                                                                XONA.DE
name                                                        Exxon Mobil (Fra)
sector                                                                 Energy
price                                                                    73.2
price_chg                                                                 0.0
price__chg                                                                0.0
vol_rate                                                               -100.0
avg_dly__vol_000                                                          322
mkt_val_mil_usd                                                        435010
pe_ratio                                                                 13.0
company_description         EXXON MOBIL CORPORATION IS A MANUFACTURER AND ...
industry_sector                                                        ENERGY
industry_group                                             Oil&G

In [172]:
# selects series
df.iloc[3]

symbol                                                                XONA.DE
name                                                        Exxon Mobil (Fra)
sector                                                                 Energy
price                                                                    73.2
price_chg                                                                 0.0
price__chg                                                                0.0
vol_rate                                                               -100.0
avg_dly__vol_000                                                          322
mkt_val_mil_usd                                                        435010
pe_ratio                                                                 13.0
company_description         EXXON MOBIL CORPORATION IS A MANUFACTURER AND ...
industry_sector                                                        ENERGY
industry_group                                             Oil&G

In [173]:
# select the single row dataframe
df.loc[[3]]

Unnamed: 0,symbol,name,sector,price,price_chg,price__chg,vol_rate,avg_dly__vol_000,mkt_val_mil_usd,pe_ratio,company_description,industry_sector,industry_group,isin,major_industry,headquarters,exchange_primary_listing,exchange,trading_country
3,XONA.DE,Exxon Mobil (Fra),Energy,73.2,0.0,0.0,-100.0,322,435010,13.0,EXXON MOBIL CORPORATION IS A MANUFACTURER AND ...,ENERGY,Oil&Gas-Integrated,US30231G1022,,,"No (NYSE, XOM)",FRANKFURT,GERMANY


### select the last row of the data frame - input: integer - output: series

In [174]:
df.iloc[-1]

symbol                                                                DEUT.IT
name                                                   Deutsche Telekom (Mil)
sector                                                             Technology
price                                                                    12.4
price_chg                                                               -0.17
price__chg                                                          -1.352426
vol_rate                                                           -42.570255
avg_dly__vol_000                                                          193
mkt_val_mil_usd                                                         75071
pe_ratio                                                                 23.0
company_description         DEUTSCHE TELEKOM AG IS A GERMANY-BASED INTEGRA...
industry_sector                                                       TELECOM
industry_group                                        Telecom Sv

In [175]:
# select the last row of the data frame - input: list - output: data frame
df.iloc[[-1]]

Unnamed: 0,symbol,name,sector,price,price_chg,price__chg,vol_rate,avg_dly__vol_000,mkt_val_mil_usd,pe_ratio,company_description,industry_sector,industry_group,isin,major_industry,headquarters,exchange_primary_listing,exchange,trading_country
498,DEUT.IT,Deutsche Telekom (Mil),Technology,12.4,-0.17,-1.352426,-42.570255,193,75071,23.0,DEUTSCHE TELEKOM AG IS A GERMANY-BASED INTEGRA...,TELECOM,Telecom Svcs-Integrated,DE0005557508,Telecommunication Equip,,"No (XETRA, DTEX.DE)",MILAN,ITALY


# Selecting multiple rows by position
To extract multiple rows by position, we pass either a list or a slice object to the .iloc[] indexer.

#### df.loc[row_labels, column_labels]

#### df.iloc[row_positions, column_positions]


By integer slices, acting similar to numpy/Python:

In [176]:
# Selecting rows and columns simultaneously
rows_to_select = [22, 45, 66, 234]

# using loc
df.loc[rows_to_select]

Unnamed: 0,symbol,name,sector,price,price_chg,price__chg,vol_rate,avg_dly__vol_000,mkt_val_mil_usd,pe_ratio,company_description,industry_sector,industry_group,isin,major_industry,headquarters,exchange_primary_listing,exchange,trading_country
22,NSRGY,Nestle S A Adr Spon,Consumer Staple,74.78,1.25,1.699986,-11.522471,37140,272947,21.0,"MANUFACTURES BEVERAGES, MILK PRODUCTS, ICE CRE...",FOOD/BEV,Food-Packaged,US6410694060,Food,SWITZERLAND,Yes,OTC,UNITED STATES
45,CVX,Chevron Corp,Energy,118.83,-0.35,-0.293673,-12.16909,681149,228530,10.0,"ENGAGED IN EXPLORATION, PRODUCTION, REFINING A...",ENERGY,Oil&Gas-Integrated,US1667641005,Oil&Gas-Integrated,"San Ramon, CA",Yes,NYSE,UNITED STATES
66,TYT.GB,Toyota Motor (Xsq),Consumer Cyclical,6285.66,53.785,0.863062,-89.918662,1097351,207913,14.0,TOYOTA MOTOR CORPORATION IS A JAPAN-BASED COMP...,AUTO,Auto Manufacturers,JP3633400001,Automobile,Aichi,"No (TOKYO, TYMO.JP)",SEAQ INTERNATION,UNITED KINGDOM
234,SCLX.DE,Schlumberger (Xet),Energy,64.1,-0.85,-1.308699,-92.407809,30,114821,18.0,SCHLUMBERGER LIMITED (SCHLUMBERGER N.V.) IS TH...,ENERGY,Oil&Gas-Field Services,AN8068571086,Oil&Gas-Equipment & Svc,,"No (NYSE, SLB)",XETRA,GERMANY


In [177]:
# we can select 
df['symbol'].loc[rows_to_select]

22       NSRGY
45         CVX
66      TYT.GB
234    SCLX.DE
Name: symbol, dtype: object

In [178]:
# using iloc method
df.iloc[rows_to_select]

Unnamed: 0,symbol,name,sector,price,price_chg,price__chg,vol_rate,avg_dly__vol_000,mkt_val_mil_usd,pe_ratio,company_description,industry_sector,industry_group,isin,major_industry,headquarters,exchange_primary_listing,exchange,trading_country
22,NSRGY,Nestle S A Adr Spon,Consumer Staple,74.78,1.25,1.699986,-11.522471,37140,272947,21.0,"MANUFACTURES BEVERAGES, MILK PRODUCTS, ICE CRE...",FOOD/BEV,Food-Packaged,US6410694060,Food,SWITZERLAND,Yes,OTC,UNITED STATES
45,CVX,Chevron Corp,Energy,118.83,-0.35,-0.293673,-12.16909,681149,228530,10.0,"ENGAGED IN EXPLORATION, PRODUCTION, REFINING A...",ENERGY,Oil&Gas-Integrated,US1667641005,Oil&Gas-Integrated,"San Ramon, CA",Yes,NYSE,UNITED STATES
66,TYT.GB,Toyota Motor (Xsq),Consumer Cyclical,6285.66,53.785,0.863062,-89.918662,1097351,207913,14.0,TOYOTA MOTOR CORPORATION IS A JAPAN-BASED COMP...,AUTO,Auto Manufacturers,JP3633400001,Automobile,Aichi,"No (TOKYO, TYMO.JP)",SEAQ INTERNATION,UNITED KINGDOM
234,SCLX.DE,Schlumberger (Xet),Energy,64.1,-0.85,-1.308699,-92.407809,30,114821,18.0,SCHLUMBERGER LIMITED (SCHLUMBERGER N.V.) IS TH...,ENERGY,Oil&Gas-Field Services,AN8068571086,Oil&Gas-Equipment & Svc,,"No (NYSE, SLB)",XETRA,GERMANY


In [179]:
# select the first five rows of the dataframe using slice notation
df.iloc[0:5]

Unnamed: 0,symbol,name,sector,price,price_chg,price__chg,vol_rate,avg_dly__vol_000,mkt_val_mil_usd,pe_ratio,company_description,industry_sector,industry_group,isin,major_industry,headquarters,exchange_primary_listing,exchange,trading_country
0,APC.DE,Apple (Fra),Technology,410.0,0.0,0.0,-100.0,1765,497751,14.0,"APPLE INC. (APPLE) DESIGNS, MANUFACTURES AND M...",COMPUTER,Computer-Hardware/Perip,US0378331005,,,"No (NASDAQ, AAPL)",FRANKFURT,GERMANY
1,APCX.DE,Apple (Xet),Technology,408.5,-2.0,-0.487211,-38.011541,6159,495930,14.0,"APPLE INC. (APPLE) DESIGNS, MANUFACTURES AND M...",COMPUTER,Computer-Hardware/Perip,US0378331005,,,"No (NASDAQ, AAPL)",XETRA,GERMANY
2,AAPL,Apple Inc,Technology,554.25,-3.11,-0.557987,-25.368715,6099413,494697,14.0,"MANUFACTURES PERSONAL COMPUTERS, MOBILE COMMUN...",COMPUTER,Computer-Hardware/Perip,US0378331005,Computer Data Storage,"Cupertino, CA",Yes,NASDAQ,UNITED STATES
3,XONA.DE,Exxon Mobil (Fra),Energy,73.2,0.0,0.0,-100.0,322,435010,13.0,EXXON MOBIL CORPORATION IS A MANUFACTURER AND ...,ENERGY,Oil&Gas-Integrated,US30231G1022,,,"No (NYSE, XOM)",FRANKFURT,GERMANY
4,WBK,Westpac Banking Corp Adr,Financial,27.99,-0.37,-1.304654,-8.477119,3780,432467,14.0,AUSTRALIAN BANK PROVIDING BANKING/RELATED FINA...,BANKS,Banks-Foreign,US9612143019,Bank-Money Center,AUSTRALIA,Yes,NYSE,UNITED STATES


In [186]:
# Selecting a single row and multiple columns
# select the name, surname, and salary of the employee with id number 478 by position
df.iloc[1, [0, 1, 3]]

# select the name, surname, and salary of the employee with id number 478 by label
df.loc[44, ['symbol', 'sector', 'price']]


symbol      RDSA
sector    Energy
price      71.74
Name: 44, dtype: object

### For getting a value (cell) explicitly:


In [187]:
df.iloc[2, 1]

'Apple Inc'

For getting fast access to a scalar (equivalent to the prior method):

In [188]:
df.iat[2, 1]

'Apple Inc'

In [189]:
# For slicing rows explicitly:
df.iloc[1:3, :]

Unnamed: 0,symbol,name,sector,price,price_chg,price__chg,vol_rate,avg_dly__vol_000,mkt_val_mil_usd,pe_ratio,company_description,industry_sector,industry_group,isin,major_industry,headquarters,exchange_primary_listing,exchange,trading_country
1,APCX.DE,Apple (Xet),Technology,408.5,-2.0,-0.487211,-38.011541,6159,495930,14.0,"APPLE INC. (APPLE) DESIGNS, MANUFACTURES AND M...",COMPUTER,Computer-Hardware/Perip,US0378331005,,,"No (NASDAQ, AAPL)",XETRA,GERMANY
2,AAPL,Apple Inc,Technology,554.25,-3.11,-0.557987,-25.368715,6099413,494697,14.0,"MANUFACTURES PERSONAL COMPUTERS, MOBILE COMMUN...",COMPUTER,Computer-Hardware/Perip,US0378331005,Computer Data Storage,"Cupertino, CA",Yes,NASDAQ,UNITED STATES


In [190]:
# The .iloc[] indexer is used to index a data frame by position.
df.iloc[2]

symbol                                                                   AAPL
name                                                                Apple Inc
sector                                                             Technology
price                                                                  554.25
price_chg                                                               -3.11
price__chg                                                          -0.557987
vol_rate                                                           -25.368715
avg_dly__vol_000                                                      6099413
mkt_val_mil_usd                                                        494697
pe_ratio                                                                 14.0
company_description         MANUFACTURES PERSONAL COMPUTERS, MOBILE COMMUN...
industry_sector                                                      COMPUTER
industry_group                                        Computer-H

In [191]:
# To extract multiple rows by position, we pass either a list or a slice object to the .iloc[] indexer.
df.iloc[[2]]

Unnamed: 0,symbol,name,sector,price,price_chg,price__chg,vol_rate,avg_dly__vol_000,mkt_val_mil_usd,pe_ratio,company_description,industry_sector,industry_group,isin,major_industry,headquarters,exchange_primary_listing,exchange,trading_country
2,AAPL,Apple Inc,Technology,554.25,-3.11,-0.557987,-25.368715,6099413,494697,14.0,"MANUFACTURES PERSONAL COMPUTERS, MOBILE COMMUN...",COMPUTER,Computer-Hardware/Perip,US0378331005,Computer Data Storage,"Cupertino, CA",Yes,NASDAQ,UNITED STATES


In [192]:
df.iloc[rows_to_select]

Unnamed: 0,symbol,name,sector,price,price_chg,price__chg,vol_rate,avg_dly__vol_000,mkt_val_mil_usd,pe_ratio,company_description,industry_sector,industry_group,isin,major_industry,headquarters,exchange_primary_listing,exchange,trading_country
22,NSRGY,Nestle S A Adr Spon,Consumer Staple,74.78,1.25,1.699986,-11.522471,37140,272947,21.0,"MANUFACTURES BEVERAGES, MILK PRODUCTS, ICE CRE...",FOOD/BEV,Food-Packaged,US6410694060,Food,SWITZERLAND,Yes,OTC,UNITED STATES
45,CVX,Chevron Corp,Energy,118.83,-0.35,-0.293673,-12.16909,681149,228530,10.0,"ENGAGED IN EXPLORATION, PRODUCTION, REFINING A...",ENERGY,Oil&Gas-Integrated,US1667641005,Oil&Gas-Integrated,"San Ramon, CA",Yes,NYSE,UNITED STATES
66,TYT.GB,Toyota Motor (Xsq),Consumer Cyclical,6285.66,53.785,0.863062,-89.918662,1097351,207913,14.0,TOYOTA MOTOR CORPORATION IS A JAPAN-BASED COMP...,AUTO,Auto Manufacturers,JP3633400001,Automobile,Aichi,"No (TOKYO, TYMO.JP)",SEAQ INTERNATION,UNITED KINGDOM
234,SCLX.DE,Schlumberger (Xet),Energy,64.1,-0.85,-1.308699,-92.407809,30,114821,18.0,SCHLUMBERGER LIMITED (SCHLUMBERGER N.V.) IS TH...,ENERGY,Oil&Gas-Field Services,AN8068571086,Oil&Gas-Equipment & Svc,,"No (NYSE, SLB)",XETRA,GERMANY


In [193]:
# Getting a single value
df.loc[1,'company_description']

"APPLE INC. (APPLE) DESIGNS, MANUFACTURES AND MARKETS MOBILE COMMUNICATION AND MEDIA DEVICES, PERSONAL COMPUTERS, AND PORTABLE DIGITAL MUSIC PLAYERS, AND SELLS A VARIETY OF RELATED SOFTWARE, SERVICES, PERIPHERALS, NETWORKING SOLUTIONS, AND THIRD-PARTY DIGITAL CONTENT AND APPLICATIONS. THE COMPANY'S PRODUCTS AND SERVICES INCLUDE IPHONE, IPAD, MAC, IPOD, APPLE TV, A PORTFOLIO OF CONSUMER AND PROFESSIONAL SOFTWARE APPLICATIONS, THE IOS AND OS X OPERATING SYSTEMS, ICLOUD, AND A VARIETY OF ACCESSORY, SERVICE AND SUPPORT OFFERINGS. IN MARCH 2013, THE COMPANY ACQUIRED A SILICON VALLEY STARTUP, WIFISLAM, WHICH MAKES MAPPING APPLICATIONS FOR SMART PHONES. EFFECTIVE JULY 19, 2013, APPLE ACQUIRED LOCATIONARY INC. EFFECTIVE JULY 20, 2013, APPLE ACQUIRED HOPSTOP.COM INC. EFFECTIVE AUGUST 28, 2013, APPLE ACQUIRED ALGOTRIM AB, A MALMO-BASED DEVELOPER OF PREPACKAGED SOFTWARE. IN NOVEMBER 2013, APPLE BOUGHT PRIMESE"

See more at Selection by Label.

* df.iloc is primarily integer position based (from 0 to length-1 of the axis), but may also be used with a boolean array. .iloc will raise IndexError if a requested indexer is out-of-bounds, except slice indexers which allow out-of-bounds indexing. (this conforms with Python/NumPy slice semantics). Allowed inputs are:

    * An integer e.g. 5.

    * A list or array of integers [4, 3, 0].

    * A slice object with ints 1:7.

    * A boolean array (any NA values will be treated as False).

    * A callable function with one argument (the calling Series or DataFrame) and that returns valid output for indexing (one of the above).

It is also possible to select by position using the *iloc* method

## Describe data quick statistical summary

In [196]:
df.describe()

Unnamed: 0,price,price_chg,price__chg,vol_rate,avg_dly__vol_000,mkt_val_mil_usd,pe_ratio
count,499.0,499.0,499.0,486.0,499.0,499.0,474.0
mean,3937.696,7.238984,0.12202,23.230114,1222698.0,137938.56513,30.776371
std,59280.55,112.630061,1.219567,426.729572,14699970.0,69760.0524,127.496568
min,0.15,-892.0,-4.782609,-100.0,0.0,75071.0,2.0
25%,26.455,-0.07,-0.291533,-100.0,33.5,88903.5,12.0
50%,54.66,0.0,0.0,-26.303047,303.0,113659.0,16.0
75%,100.445,0.17,0.35156,17.051872,88088.5,175293.0,21.0
max,1301000.0,2000.0,9.458656,5900.0,297549100.0,497751.0,1414.0


## Transpose your data

In [197]:
df.T

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,489,490,491,492,493,494,495,496,497,498
symbol,APC.DE,APCX.DE,AAPL,XONA.DE,WBK,XOM,XOAX.DE,KO.PE,GOOG,GG@X.DE,...,USB,DTEA.DE,DTE.NL,DTE.RO,RIO1.DE,DTEX.DE,BG.GB,FOXA,ANZ.AU,DEUT.IT
name,Apple (Fra),Apple (Xet),Apple Inc,Exxon Mobil (Fra),Westpac Banking Corp Adr,Exxon Mobil Corp,Exxon Mobil (Xet),Coca Cola (Lim),Google Inc,Google 'A' (Xet),...,U S Bancorp Inc,Dt.Telekom Spn.Adr.(Fra),Deutsche Telekom (Ams),Deutsche Telekom (Bse),Rio Tinto (Fra),Deutsche Telekom (Xet),Bg Group,Twenty-First Cen Fx Cl A,Aus.And Nz.Banking Gp.,Deutsche Telekom (Mil)
sector,Technology,Technology,Technology,Energy,Financial,Energy,Energy,Consumer Staple,Technology,Technology,...,Financial,Technology,Technology,Technology,Basic Material,Technology,Energy,Consumer Cyclical,Financial,Technology
price,410.0,408.5,554.25,73.2,27.99,98.94,72.32,90.72,1156.22,849.35,...,41.46,12.5,12.5,56.5,39.25,12.45,13.52,32.47,31.09,12.4
price_chg,0.0,-2.0,-3.11,0.0,-0.37,0.16,-0.39,0.0,7.6,4.2,...,-0.04,0.0,-0.02,1.25,0.0,-0.195,0.17,-0.14,0.12,-0.17
price__chg,0.0,-0.487211,-0.557987,0.0,-1.304654,0.161976,-0.536377,0.0,0.661663,0.496953,...,-0.096385,0.0,-0.159744,2.262443,0.0,-1.542722,1.273885,-0.429316,0.387472,-1.352426
vol_rate,-100.0,-38.011541,-25.368715,-100.0,-8.477119,-22.553492,-63.128791,,5.16121,19.008264,...,1.862741,-100.0,-90.344828,-100.0,-100.0,4.413403,58.304817,-26.565262,-9.287592,-42.570255
avg_dly__vol_000,1765,6159,6099413,322,3780,1201688,405,0,1838251,1028,...,323937,18,18,0,306,124042,49949,354171,169869,193
mkt_val_mil_usd,497751,495930,494697,435010,432467,432220,429768,400612,386278,319639,...,75718,75679,75676,75526,75417,75343,75310,75153,75132,75071
pe_ratio,14.0,14.0,14.0,13.0,14.0,13.0,13.0,47.0,27.0,31.0,...,14.0,24.0,23.0,22.0,,23.0,16.0,25.0,13.0,23.0


## Answering simple questions about a dataset

### How many companies are there by Sector?

In [198]:
df['sector'].value_counts()

Financial            116
Technology            89
Energy                75
Health Care           63
Consumer Staple       52
Capital Equipment     32
Consumer Cyclical     28
Retail                21
Basic Material        20
Transportation         3
Name: sector, dtype: int64

### What is the average Market Capitalization?

In [199]:
df['mkt_val_mil_usd'].mean()

137938.5651302605

### What is the most frequent sector?

In [200]:
df['sector'].describe()

count           499
unique           10
top       Financial
freq            116
Name: sector, dtype: object

## Who are the 5 largest companies?

In [201]:
df.sort_values('mkt_val_mil_usd', ascending=False)[:5]

Unnamed: 0,symbol,name,sector,price,price_chg,price__chg,vol_rate,avg_dly__vol_000,mkt_val_mil_usd,pe_ratio,company_description,industry_sector,industry_group,isin,major_industry,headquarters,exchange_primary_listing,exchange,trading_country
0,APC.DE,Apple (Fra),Technology,410.0,0.0,0.0,-100.0,1765,497751,14.0,"APPLE INC. (APPLE) DESIGNS, MANUFACTURES AND M...",COMPUTER,Computer-Hardware/Perip,US0378331005,,,"No (NASDAQ, AAPL)",FRANKFURT,GERMANY
1,APCX.DE,Apple (Xet),Technology,408.5,-2.0,-0.487211,-38.011541,6159,495930,14.0,"APPLE INC. (APPLE) DESIGNS, MANUFACTURES AND M...",COMPUTER,Computer-Hardware/Perip,US0378331005,,,"No (NASDAQ, AAPL)",XETRA,GERMANY
2,AAPL,Apple Inc,Technology,554.25,-3.11,-0.557987,-25.368715,6099413,494697,14.0,"MANUFACTURES PERSONAL COMPUTERS, MOBILE COMMUN...",COMPUTER,Computer-Hardware/Perip,US0378331005,Computer Data Storage,"Cupertino, CA",Yes,NASDAQ,UNITED STATES
3,XONA.DE,Exxon Mobil (Fra),Energy,73.2,0.0,0.0,-100.0,322,435010,13.0,EXXON MOBIL CORPORATION IS A MANUFACTURER AND ...,ENERGY,Oil&Gas-Integrated,US30231G1022,,,"No (NYSE, XOM)",FRANKFURT,GERMANY
4,WBK,Westpac Banking Corp Adr,Financial,27.99,-0.37,-1.304654,-8.477119,3780,432467,14.0,AUSTRALIAN BANK PROVIDING BANKING/RELATED FINA...,BANKS,Banks-Foreign,US9612143019,Bank-Money Center,AUSTRALIA,Yes,NYSE,UNITED STATES


# Boolean Indexing
Boolean indexing
Another common operation is the use of boolean vectors to filter the data. The operators are: | for or, & for and, and ~ for not.
These must be grouped by using parentheses, since by default Python will evaluate an expression such as df['A'] > 2 & df['B'] < 3 as df['A'] > (2 & df['B']) < 3, while the desired evaluation order is (df['A'] > 2) & (df['B'] < 3).

In [202]:
df[df.mkt_val_mil_usd > 425010]

Unnamed: 0,symbol,name,sector,price,price_chg,price__chg,vol_rate,avg_dly__vol_000,mkt_val_mil_usd,pe_ratio,company_description,industry_sector,industry_group,isin,major_industry,headquarters,exchange_primary_listing,exchange,trading_country
0,APC.DE,Apple (Fra),Technology,410.0,0.0,0.0,-100.0,1765,497751,14.0,"APPLE INC. (APPLE) DESIGNS, MANUFACTURES AND M...",COMPUTER,Computer-Hardware/Perip,US0378331005,,,"No (NASDAQ, AAPL)",FRANKFURT,GERMANY
1,APCX.DE,Apple (Xet),Technology,408.5,-2.0,-0.487211,-38.011541,6159,495930,14.0,"APPLE INC. (APPLE) DESIGNS, MANUFACTURES AND M...",COMPUTER,Computer-Hardware/Perip,US0378331005,,,"No (NASDAQ, AAPL)",XETRA,GERMANY
2,AAPL,Apple Inc,Technology,554.25,-3.11,-0.557987,-25.368715,6099413,494697,14.0,"MANUFACTURES PERSONAL COMPUTERS, MOBILE COMMUN...",COMPUTER,Computer-Hardware/Perip,US0378331005,Computer Data Storage,"Cupertino, CA",Yes,NASDAQ,UNITED STATES
3,XONA.DE,Exxon Mobil (Fra),Energy,73.2,0.0,0.0,-100.0,322,435010,13.0,EXXON MOBIL CORPORATION IS A MANUFACTURER AND ...,ENERGY,Oil&Gas-Integrated,US30231G1022,,,"No (NYSE, XOM)",FRANKFURT,GERMANY
4,WBK,Westpac Banking Corp Adr,Financial,27.99,-0.37,-1.304654,-8.477119,3780,432467,14.0,AUSTRALIAN BANK PROVIDING BANKING/RELATED FINA...,BANKS,Banks-Foreign,US9612143019,Bank-Money Center,AUSTRALIA,Yes,NYSE,UNITED STATES
5,XOM,Exxon Mobil Corp,Energy,98.94,0.16,0.161976,-22.553492,1201688,432220,13.0,"ENGAGED IN THE EXPLORATION, PRODUCTION, TRANSP...",ENERGY,Oil&Gas-Integrated,US30231G1022,Oil&Gas-Integrated,"Irving, TX",Yes,NYSE,UNITED STATES
6,XOAX.DE,Exxon Mobil (Xet),Energy,72.32,-0.39,-0.536377,-63.128791,405,429768,13.0,EXXON MOBIL CORPORATION IS A MANUFACTURER AND ...,ENERGY,Oil&Gas-Integrated,US30231G1022,,,"No (NYSE, XOM)",XETRA,GERMANY


In [211]:
#  exclude COMPUTER sector
df[~(df.sector == "Technology")]

Unnamed: 0,symbol,name,sector,price,price_chg,price__chg,vol_rate,avg_dly__vol_000,mkt_val_mil_usd,pe_ratio,company_description,industry_sector,industry_group,isin,major_industry,headquarters,exchange_primary_listing,exchange,trading_country
3,XONA.DE,Exxon Mobil (Fra),Energy,73.20,0.00,0.000000,-100.000000,322,435010,13.0,EXXON MOBIL CORPORATION IS A MANUFACTURER AND ...,ENERGY,Oil&Gas-Integrated,US30231G1022,,,"No (NYSE, XOM)",FRANKFURT,GERMANY
4,WBK,Westpac Banking Corp Adr,Financial,27.99,-0.37,-1.304654,-8.477119,3780,432467,14.0,AUSTRALIAN BANK PROVIDING BANKING/RELATED FINA...,BANKS,Banks-Foreign,US9612143019,Bank-Money Center,AUSTRALIA,Yes,NYSE,UNITED STATES
5,XOM,Exxon Mobil Corp,Energy,98.94,0.16,0.161976,-22.553492,1201688,432220,13.0,"ENGAGED IN THE EXPLORATION, PRODUCTION, TRANSP...",ENERGY,Oil&Gas-Integrated,US30231G1022,Oil&Gas-Integrated,"Irving, TX",Yes,NYSE,UNITED STATES
6,XOAX.DE,Exxon Mobil (Xet),Energy,72.32,-0.39,-0.536377,-63.128791,405,429768,13.0,EXXON MOBIL CORPORATION IS A MANUFACTURER AND ...,ENERGY,Oil&Gas-Integrated,US30231G1022,,,"No (NYSE, XOM)",XETRA,GERMANY
7,KO.PE,Coca Cola (Lim),Consumer Staple,90.72,0.00,0.000000,,0,400612,47.0,THE COCA-COLA COMPANY IS A BEVERAGE COMPANY. T...,FOOD/BEV,Beverages-Non-Alcoholic,US1912161007,Beverage - Non-Alcoholic,,"No (NYSE, KO)",LIMA,PERU
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
489,USB,U S Bancorp Inc,Financial,41.46,-0.04,-0.096385,1.862741,323937,75718,14.0,HOLDING COMPANY FOR U.S. BANK OPERATING THROUG...,BANKS,Banks-Super Regional,US9029733048,Bank-Super Regional,"Minneapolis, MN",Yes,NYSE,UNITED STATES
493,RIO1.DE,Rio Tinto (Fra),Basic Material,39.25,0.00,0.000000,-100.000000,306,75417,,RIO TINTO PLC (RIO TINTO) IS AN INTERNATIONAL ...,MINING,Mining-Metal Ores,GB0007188757,,England,"No (LONDON, RIO.GB)",FRANKFURT,GERMANY
495,BG.GB,Bg Group,Energy,13.52,0.17,1.273885,58.304817,49949,75310,16.0,BG GROUP PLC (BG GROUP) IS A NATURAL GAS COMPA...,ENERGY,Oil&Gas-Integrated,GB0008762899,,Reading,Yes,LONDON,UNITED KINGDOM
496,FOXA,Twenty-First Cen Fx Cl A,Consumer Cyclical,32.47,-0.14,-0.429316,-26.565262,354171,75153,25.0,GLOBAL MEDIA AND ENTERTAINMENT COMPANY ENGAGED...,MEDIA,Media-Diversified,US90130A1016,Media - Radio/Tv,"New York, NY",Yes,NASDAQ,UNITED STATES


Indexing with isin

In [212]:
df.symbol

0       APC.DE
1      APCX.DE
2         AAPL
3      XONA.DE
4          WBK
        ...   
494    DTEX.DE
495      BG.GB
496       FOXA
497     ANZ.AU
498    DEUT.IT
Name: symbol, Length: 499, dtype: object

In [213]:
df.symbol.isin(["AAPL", "FB", "WBK", "DTEX.DE", "FOXA", "MSFT"])

0      False
1      False
2       True
3      False
4       True
       ...  
494     True
495    False
496     True
497    False
498    False
Name: symbol, Length: 499, dtype: bool

In [214]:
df[df.symbol.isin(["AAPL", "FB", "WBK", "DTEX.DE", "FOXA", "MSFT"])]

Unnamed: 0,symbol,name,sector,price,price_chg,price__chg,vol_rate,avg_dly__vol_000,mkt_val_mil_usd,pe_ratio,company_description,industry_sector,industry_group,isin,major_industry,headquarters,exchange_primary_listing,exchange,trading_country
2,AAPL,Apple Inc,Technology,554.25,-3.11,-0.557987,-25.368715,6099413,494697,14.0,"MANUFACTURES PERSONAL COMPUTERS, MOBILE COMMUN...",COMPUTER,Computer-Hardware/Perip,US0378331005,Computer Data Storage,"Cupertino, CA",Yes,NASDAQ,UNITED STATES
4,WBK,Westpac Banking Corp Adr,Financial,27.99,-0.37,-1.304654,-8.477119,3780,432467,14.0,AUSTRALIAN BANK PROVIDING BANKING/RELATED FINA...,BANKS,Banks-Foreign,US9612143019,Bank-Money Center,AUSTRALIA,Yes,NYSE,UNITED STATES
11,MSFT,Microsoft Corp,Technology,36.89,0.13,0.353645,-4.880698,1468683,307956,13.0,"DEVELOPS OPERATING SYSTEMS, BUSINESS SOFTWARE,...",SOFTWARE,Computer Sftwr-Desktop,US5949181045,Computer-Software,"Redmond, WA",Yes,NASDAQ,UNITED STATES
160,FB,Facebook Inc Cl A,Technology,57.19,-0.41,-0.711805,-48.25063,3840670,145274,79.0,PROVIDES A SOCIAL NETWORKING PLATFORM ENABLING...,INTERNET,Internet-Content,US30303M1027,Internet,"Menlo Park, CA",Yes,NASDAQ,UNITED STATES
494,DTEX.DE,Deutsche Telekom (Xet),Technology,12.45,-0.195,-1.542722,4.413403,124042,75343,23.0,DEUTSCHE TELEKOM AG IS A GERMANY-BASED INTEGRA...,TELECOM,Telecom Svcs-Integrated,DE0005557508,Telecommunication Equip,,Yes,XETRA,GERMANY
496,FOXA,Twenty-First Cen Fx Cl A,Consumer Cyclical,32.47,-0.14,-0.429316,-26.565262,354171,75153,25.0,GLOBAL MEDIA AND ENTERTAINMENT COMPANY ENGAGED...,MEDIA,Media-Diversified,US90130A1016,Media - Radio/Tv,"New York, NY",Yes,NASDAQ,UNITED STATES


## Further questions

In [215]:
df.columns

Index(['symbol', 'name', 'sector', 'price', 'price_chg', 'price__chg',
       'vol_rate', 'avg_dly__vol_000', 'mkt_val_mil_usd', 'pe_ratio',
       'company_description', 'industry_sector', 'industry_group', 'isin',
       'major_industry', 'headquarters', 'exchange_primary_listing',
       'exchange', 'trading_country'],
      dtype='object')

### Give me the list of the companies in the Technology Sector

In [219]:
df['sector'] == 'Technology'

0       True
1       True
2       True
3      False
4      False
       ...  
494     True
495    False
496    False
497    False
498     True
Name: sector, Length: 499, dtype: bool

We can use a boolean series to index a Series or a DataFrame, this is called "Masking" or boolean indexing.

In [220]:
df.loc[df['sector'] == 'Technology']

Unnamed: 0,symbol,name,sector,price,price_chg,price__chg,vol_rate,avg_dly__vol_000,mkt_val_mil_usd,pe_ratio,company_description,industry_sector,industry_group,isin,major_industry,headquarters,exchange_primary_listing,exchange,trading_country
0,APC.DE,Apple (Fra),Technology,410.00,0.000,0.000000,-100.000000,1765,497751,14.0,"APPLE INC. (APPLE) DESIGNS, MANUFACTURES AND M...",COMPUTER,Computer-Hardware/Perip,US0378331005,,,"No (NASDAQ, AAPL)",FRANKFURT,GERMANY
1,APCX.DE,Apple (Xet),Technology,408.50,-2.000,-0.487211,-38.011541,6159,495930,14.0,"APPLE INC. (APPLE) DESIGNS, MANUFACTURES AND M...",COMPUTER,Computer-Hardware/Perip,US0378331005,,,"No (NASDAQ, AAPL)",XETRA,GERMANY
2,AAPL,Apple Inc,Technology,554.25,-3.110,-0.557987,-25.368715,6099413,494697,14.0,"MANUFACTURES PERSONAL COMPUTERS, MOBILE COMMUN...",COMPUTER,Computer-Hardware/Perip,US0378331005,Computer Data Storage,"Cupertino, CA",Yes,NASDAQ,UNITED STATES
8,GOOG,Google Inc,Technology,1156.22,7.600,0.661663,5.161210,1838251,386278,27.0,"PROVIDES ONLINE SEARCH, INTERNET CONTENT SERVI...",INTERNET,Internet-Content,US38259P5089,Internet,"Mountain View, CA",Yes,NASDAQ,UNITED STATES
9,GG@X.DE,Google 'A' (Xet),Technology,849.35,4.200,0.496953,19.008264,1028,319639,31.0,GOOGLE INC. (GOOGLE) IS A GLOBAL TECHNOLOGY CO...,INTERNET,Internet-Content,US38259P5089,,,"No (NASDAQ, GOOG)",XETRA,GERMANY
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
490,DTEA.DE,Dt.Telekom Spn.Adr.(Fra),Technology,12.50,0.000,0.000000,-100.000000,18,75679,24.0,DEUTSCHE TELEKOM AG IS A GERMANY-BASED INTEGRA...,TELECOM,Telecom Svcs-Integrated,US2515661054,Telecommunication Equip,,"No (OTC, DTEGY)",FRANKFURT,GERMANY
491,DTE.NL,Deutsche Telekom (Ams),Technology,12.50,-0.020,-0.159744,-90.344828,18,75676,23.0,DEUTSCHE TELEKOM AG IS A GERMANY-BASED INTEGRA...,TELECOM,Telecom Svcs-Integrated,DE0005557508,Telecommunication Equip,,"No (XETRA, DTEX.DE)",AMSTERDAM (AEX),NETHERLANDS
492,DTE.RO,Deutsche Telekom (Bse),Technology,56.50,1.250,2.262443,-100.000000,0,75526,22.0,DEUTSCHE TELEKOM AG IS A GERMANY-BASED INTEGRA...,TELECOM,Telecom Svcs-Integrated,DE0005557508,Telecommunication Servce,,"No (XETRA, DTEX.DE)",BUCHAREST,ROMANIA
494,DTEX.DE,Deutsche Telekom (Xet),Technology,12.45,-0.195,-1.542722,4.413403,124042,75343,23.0,DEUTSCHE TELEKOM AG IS A GERMANY-BASED INTEGRA...,TELECOM,Telecom Svcs-Integrated,DE0005557508,Telecommunication Equip,,Yes,XETRA,GERMANY


### Give me the list of the companies in the Technology Sector and United States

In [221]:
df.loc[(df['sector'] == 'Technology') & (df['trading_country'] == 'UNITED STATES')]

Unnamed: 0,symbol,name,sector,price,price_chg,price__chg,vol_rate,avg_dly__vol_000,mkt_val_mil_usd,pe_ratio,company_description,industry_sector,industry_group,isin,major_industry,headquarters,exchange_primary_listing,exchange,trading_country
2,AAPL,Apple Inc,Technology,554.25,-3.11,-0.557987,-25.368715,6099413,494697,14.0,"MANUFACTURES PERSONAL COMPUTERS, MOBILE COMMUN...",COMPUTER,Computer-Hardware/Perip,US0378331005,Computer Data Storage,"Cupertino, CA",Yes,NASDAQ,UNITED STATES
8,GOOG,Google Inc,Technology,1156.22,7.6,0.661663,5.16121,1838251,386278,27.0,"PROVIDES ONLINE SEARCH, INTERNET CONTENT SERVI...",INTERNET,Internet-Content,US38259P5089,Internet,"Mountain View, CA",Yes,NASDAQ,UNITED STATES
11,MSFT,Microsoft Corp,Technology,36.89,0.13,0.353645,-4.880698,1468683,307956,13.0,"DEVELOPS OPERATING SYSTEMS, BUSINESS SOFTWARE,...",SOFTWARE,Computer Sftwr-Desktop,US5949181045,Computer-Software,"Redmond, WA",Yes,NASDAQ,UNITED STATES
73,IBM,Intl Business Machines,Technology,188.76,1.02,0.543304,-1.631052,909319,204965,12.0,PROVIDES IT CONSULTING SERVICES AND COMPUTER H...,BUSINS SVC,Computer-Tech Services,US4592001014,Computer-Services,"Armonk, NY",Yes,NYSE,UNITED STATES
87,CHL,China Mobile Ltd Adr,Technology,50.17,-0.22,-0.436594,3.262383,50662,201693,10.0,HONG KONG-BASED PROVIDER OF DIGITAL WIRELESS V...,TELECOM,Telecom Svcs- Foreign,US16941M1099,Telecommunication Servce,HONG KONG,Yes,NYSE,UNITED STATES
98,VOD,Vodafone Group Plc Adr,Technology,38.83,-0.1,-0.256871,-34.011403,297016,191470,25.0,U.K.-BASED PROVIDER OF DIGITAL WIRELESS VOICE ...,TELECOM,Telecom Svcs- Foreign,US92857W2098,Telecommunication Servce,UNITED KINGDOM,Yes,NASDAQ,UNITED STATES
119,T,A T & T Inc,Technology,33.96,0.17,0.503107,-23.260207,713753,178901,14.0,"PROVIDES LOCAL EXCHANGE, LONG DISTANCE, NETWOR...",TELECOM,Telecom Svcs-Integrated,US00206R1023,Telecommunication Servce,"Dallas, TX",Yes,NYSE,UNITED STATES
130,ORCL,Oracle Corp,Technology,38.29,-0.12,-0.312418,-28.108149,718701,172205,14.0,"DEVELOPS DATABASE, MIDDLEWARE AND BUSINESS APP...",SOFTWARE,Computer Sftwr-Database,US68389X1054,Computer-Software,"Redwood City, CA",Yes,NYSE,UNITED STATES
160,FB,Facebook Inc Cl A,Technology,57.19,-0.41,-0.711805,-48.25063,3840670,145274,79.0,PROVIDES A SOCIAL NETWORKING PLATFORM ENABLING...,INTERNET,Internet-Content,US30303M1027,Internet,"Menlo Park, CA",Yes,NASDAQ,UNITED STATES
167,CMCSA,Comcast Corp Cl A,Technology,53.54,-0.53,-0.98021,2.945779,532114,140029,23.0,"PROVIDES VIDEO, INTERNET, PHONE, NETWORK, AND ...",TELECOM,Telecom Svcs-Cable/Satl,US20030N1019,Telecommunication Servce,"Philadelphia, PA",Yes,NASDAQ,UNITED STATES


**Grouping operations**: Split-Apply-Combine operation.

By **gourping** or **group by** operations we are referring to a process involving one or more of the following steps:

- **Splitting** the data into groups based on some criteria
- **Applying** a function to each group independently
- **Combining** the results into a data structure


<img src="img/split_apply_combine.png">

<b>Step1 (Split): </b> The <i>groupby</i> operation <b><i>splits</b></i> the dataframe into a group of dataframes based on some criteria. Note that the grouped object is <i>not</i> a dataframe. It is a GroupBy object. It has a dictionary-like structure and is also iterable.

<b>Step 2 (Analyze):</b> Once we have a grouped object we can <b><i>apply</b></i> functions or run analysis to each group, set of groups, or the entire group.

<b>Step 3 (Combine):</b> We can also <b><i>combine</b></i> the results of the analysis into a new data structure(s).

Since we are only interested in the employees with "Low" and "Very High" JobSatisfaction levels, let's create a new DataFrame containing only those observations.

In [224]:
subset_of_interest = df.loc[(df['sector'] == "Technology") | (df['sector'] == "Energy")]

subset_of_interest.shape

(164, 19)

Since our JobSatisfaction variable had 4 categories, this categories have stayed in the series of this new DataFrame:

In [225]:
subset_of_interest['sector'].value_counts()

Technology    89
Energy        75
Name: sector, dtype: int64

Let's remove those categories we won't be using:

In [226]:
subset_of_interest['sector'].value_counts()

Technology    89
Energy        75
Name: sector, dtype: int64

Now we have only the employees we are interested in, we can now compare accross the variables we wanted. First let's split our new DataFrame into groups.

In [227]:
grouped = subset_of_interest.groupby('sector')

In [228]:
grouped.groups

{'Energy': [3, 5, 6, 28, 35, 42, 43, 44, 45, 86, 97, 101, 110, 151, 152, 154, 155, 156, 157, 161, 162, 163, 164, 165, 166, 168, 169, 170, 171, 172, 173, 176, 184, 228, 229, 230, 231, 234, 319, 321, 322, 325, 348, 352, 353, 355, 362, 364, 384, 385, 386, 387, 388, 390, 396, 398, 400, 410, 418, 424, 427, 430, 433, 435, 436, 440, 441, 443, 444, 446, 451, 467, 485, 487, 495], 'Technology': [0, 1, 2, 8, 9, 10, 11, 12, 13, 24, 67, 68, 73, 75, 76, 79, 81, 85, 87, 88, 98, 99, 102, 103, 104, 105, 115, 116, 117, 119, 120, 127, 128, 130, 160, 167, 174, 179, 182, 190, 194, 195, 196, 209, 210, 211, 212, 213, 214, 215, 216, 217, 218, 219, 220, 221, 223, 224, 226, 227, 244, 258, 281, 283, 287, 288, 289, 290, 291, 297, 301, 310, 342, 359, 360, 455, 470, 472, 473, 478, 480, 482, 483, 486, 490, 491, 492, 494, 498]}

In [229]:
grouped.get_group('Energy').head()

Unnamed: 0,symbol,name,sector,price,price_chg,price__chg,vol_rate,avg_dly__vol_000,mkt_val_mil_usd,pe_ratio,company_description,industry_sector,industry_group,isin,major_industry,headquarters,exchange_primary_listing,exchange,trading_country
3,XONA.DE,Exxon Mobil (Fra),Energy,73.2,0.0,0.0,-100.0,322,435010,13.0,EXXON MOBIL CORPORATION IS A MANUFACTURER AND ...,ENERGY,Oil&Gas-Integrated,US30231G1022,,,"No (NYSE, XOM)",FRANKFURT,GERMANY
5,XOM,Exxon Mobil Corp,Energy,98.94,0.16,0.161976,-22.553492,1201688,432220,13.0,"ENGAGED IN THE EXPLORATION, PRODUCTION, TRANSP...",ENERGY,Oil&Gas-Integrated,US30231G1022,Oil&Gas-Integrated,"Irving, TX",Yes,NYSE,UNITED STATES
6,XOAX.DE,Exxon Mobil (Xet),Energy,72.32,-0.39,-0.536377,-63.128791,405,429768,13.0,EXXON MOBIL CORPORATION IS A MANUFACTURER AND ...,ENERGY,Oil&Gas-Integrated,US30231G1022,,,"No (NYSE, XOM)",XETRA,GERMANY
28,CVX.BE,Chevron Cert. (Bru),Energy,87.29,0.29,0.333333,766.666667,5,256073,,,ENERGY,Oil&Gas-Integrated,BE0004589306,,,Yes,BRUSSELS,BELGIUM
35,RDSB,Royal Dutch Shell B Ads,Energy,75.42,0.35,0.466231,92.496979,68336,241329,11.0,"ENGAGED IN EXPLORATION, PRODUCTION AND REFININ...",ENERGY,Oil&Gas-Integrated,US7802591070,Oil&Gas-Integrated,NETHERLANDS,Yes,NYSE,UNITED STATES


#### Age

In [232]:
grouped['pe_ratio']

<pandas.core.groupby.generic.SeriesGroupBy object at 0x000002337B4B6970>

In [233]:
grouped['pe_ratio'].mean()

sector
Energy        10.671233
Technology    19.068182
Name: pe_ratio, dtype: float64

In [234]:
grouped['pe_ratio'].describe()

Unnamed: 0_level_0,count,mean,std,min,25%,50%,75%,max
sector,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
Energy,73.0,10.671233,4.160178,2.0,8.0,11.0,13.0,20.0
Technology,88.0,19.068182,11.357104,6.0,13.0,16.0,23.0,79.0


In [235]:
grouped['pe_ratio'].describe().unstack()

       sector    
count  Energy        73.000000
       Technology    88.000000
mean   Energy        10.671233
       Technology    19.068182
std    Energy         4.160178
       Technology    11.357104
min    Energy         2.000000
       Technology     6.000000
25%    Energy         8.000000
       Technology    13.000000
50%    Energy        11.000000
       Technology    16.000000
75%    Energy        13.000000
       Technology    23.000000
max    Energy        20.000000
       Technology    79.000000
dtype: float64

#### Department

In [236]:
grouped['industry_group'].value_counts().unstack()

industry_group,Computer Sftwr-Database,Computer Sftwr-Desktop,Computer Sftwr-Enterprse,Computer-Hardware/Perip,Computer-Networking,Computer-Tech Services,Elec-Misc Products,Elec-Semicondctor Fablss,Elec-Semiconductor Mfg,Internet-Content,Oil&Gas-Field Services,Oil&Gas-Integrated,Oil&Gas-Intl Expl&Prod,Oil&Gas-U S Expl&Prod,Telecom Svcs- Foreign,Telecom Svcs-Cable/Satl,Telecom Svcs-Integrated,Telecom Svcs-Wireless
sector,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1
Energy,,,,,,,,,,,9.0,55.0,10.0,1.0,,,,
Technology,3.0,5.0,7.0,3.0,4.0,4.0,10.0,1.0,7.0,7.0,,,,,9.0,3.0,23.0,3.0


In [237]:
100 * grouped['industry_group'].value_counts(normalize=True).unstack()

industry_group,Computer Sftwr-Database,Computer Sftwr-Desktop,Computer Sftwr-Enterprse,Computer-Hardware/Perip,Computer-Networking,Computer-Tech Services,Elec-Misc Products,Elec-Semicondctor Fablss,Elec-Semiconductor Mfg,Internet-Content,Oil&Gas-Field Services,Oil&Gas-Integrated,Oil&Gas-Intl Expl&Prod,Oil&Gas-U S Expl&Prod,Telecom Svcs- Foreign,Telecom Svcs-Cable/Satl,Telecom Svcs-Integrated,Telecom Svcs-Wireless
sector,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1
Energy,,,,,,,,,,,12.0,73.333333,13.333333,1.333333,,,,
Technology,3.370787,5.617978,7.865169,3.370787,4.494382,4.494382,11.235955,1.123596,7.865169,7.865169,,,,,10.11236,3.370787,25.842697,3.370787


#### DistanceFromHome

In [238]:
grouped['trading_country'].describe().unstack()

        sector    
count   Energy             75
        Technology         89
unique  Energy             15
        Technology         15
top     Energy        GERMANY
        Technology    GERMANY
freq    Energy             38
        Technology         45
dtype: object

#### HourlyRate

In [239]:
grouped['pe_ratio'].describe().unstack()

       sector    
count  Energy        73.000000
       Technology    88.000000
mean   Energy        10.671233
       Technology    19.068182
std    Energy         4.160178
       Technology    11.357104
min    Energy         2.000000
       Technology     6.000000
25%    Energy         8.000000
       Technology    13.000000
50%    Energy        11.000000
       Technology    16.000000
75%    Energy        13.000000
       Technology    23.000000
max    Energy        20.000000
       Technology    79.000000
dtype: float64

#### MonthlyIncome

In [243]:
grouped['mkt_val_mil_usd'].describe().unstack()

       sector    
count  Energy            75.000000
       Technology        89.000000
mean   Energy        133881.000000
       Technology    159120.764045
std    Energy         77034.198447
       Technology     90756.277287
min    Energy         75310.000000
       Technology     75071.000000
25%    Energy         83318.000000
       Technology    101794.000000
50%    Energy        114821.000000
       Technology    125007.000000
75%    Energy        145074.000000
       Technology    188273.000000
max    Energy        435010.000000
       Technology    497751.000000
dtype: float64

#### YearsAtCompany

In [244]:
grouped['avg_dly__vol_000'].describe().unstack()

       sector    
count  Energy        7.500000e+01
       Technology    8.900000e+01
mean   Energy        1.802039e+05
       Technology    4.982875e+06
std    Energy        7.297668e+05
       Technology    3.378520e+07
min    Energy        0.000000e+00
       Technology    0.000000e+00
25%    Energy        1.650000e+01
       Technology    3.200000e+01
50%    Energy        3.500000e+02
       Technology    3.960000e+02
75%    Energy        8.587950e+04
       Technology    1.453120e+05
max    Energy        5.990811e+06
       Technology    2.975491e+08
dtype: float64

### Comparing the means across all numerical variables

Although we we asked for just some specific columns, to give the HR director a better picture of how these groups compare across different variables, let's create a DataFrame that contains the mean for every numeric variable in our dataset.

In [245]:
# Getting the numerical columns
numeric_cols = subset_of_interest.select_dtypes(include=[np.number]).columns

NameError: name 'np' is not defined

In [None]:
# Creating an empty DataFrame
mean_comparison_df = pd.DataFrame(columns=numeric_cols, index=['Low', 'Very High'])
mean_comparison_df

In [None]:
grouped['Age'].mean()

In [None]:
# Filling the DataFrame
for var in numeric_cols:
    mean_comparison_df[var] = grouped[var].mean()

In [None]:
mean_comparison_df

In [None]:
mean_comparison_df = mean_comparison_df.transpose()
mean_comparison_df

### Let's do a visualization

In [None]:
mean_comparison_df.plot(kind='bar', figsize=(13,4),
                                   title="Comparison of Means");

In [None]:
overal_means = df.mean()
normalized_mean_comparison_df = mean_comparison_df.copy()

In [None]:
overal_means = df.mean()
normalized_mean_comparison_df['Low'] = mean_comparison_df['Low'] / overal_means
normalized_mean_comparison_df['Very High'] = mean_comparison_df['Very High'] / overal_means

In [None]:
normalized_mean_comparison_df.plot(kind='bar', figsize=(13,4),
                                   title="Comparison of Normalized Means")
plt.legend(loc='lower left', bbox_to_anchor=(0.16, 1.0))
plt.text(x=-0.2, y = 1.2, s="JobSatisfaction:", fontdict={'size':14});

