# The Pandas library

**From the Pandas documentation:**

**pandas** is everyone's favorite data analyis library providing fast, flexible, and expressive data structures designed to work with *relational* or table-like data (SQL table or Excel spreadsheet). It is a fundamental high-level building block for doing practical, real world data analysis in Python.

In [37]:
# The importing convention
import pandas as pd

# Opening (Reading) Files


## Reading Excel File

In [42]:
filepath = "data\stock_data_simple.xlsx"

In [43]:
df = pd.read_excel(filepath)

## Reading CSV File

In [44]:
filepath = "data\stock_data_simple.csv"

In [45]:
df = pd.read_csv(filepath)

# View/Display data

In [46]:
df.head()

Unnamed: 0,symbol,name,sector,price,price_chg,price__chg,vol_rate,avg_dly__vol_000,mkt_val_mil_usd,pe_ratio,company_description,industry_sector,industry_group,isin,major_industry,headquarters,exchange_primary_listing,exchange,trading_country
0,APC.DE,Apple (Fra),Technology,410.0,0.0,0.0,-100.0,1765,497751,14.0,"APPLE INC. (APPLE) DESIGNS, MANUFACTURES AND M...",COMPUTER,Computer-Hardware/Perip,US0378331005,,,"No (NASDAQ, AAPL)",FRANKFURT,GERMANY
1,APCX.DE,Apple (Xet),Technology,408.5,-2.0,-0.487211,-38.011541,6159,495930,14.0,"APPLE INC. (APPLE) DESIGNS, MANUFACTURES AND M...",COMPUTER,Computer-Hardware/Perip,US0378331005,,,"No (NASDAQ, AAPL)",XETRA,GERMANY
2,AAPL,Apple Inc,Technology,554.25,-3.11,-0.557987,-25.368715,6099413,494697,14.0,"MANUFACTURES PERSONAL COMPUTERS, MOBILE COMMUN...",COMPUTER,Computer-Hardware/Perip,US0378331005,Computer Data Storage,"Cupertino, CA",Yes,NASDAQ,UNITED STATES
3,XONA.DE,Exxon Mobil (Fra),Energy,73.2,0.0,0.0,-100.0,322,435010,13.0,EXXON MOBIL CORPORATION IS A MANUFACTURER AND ...,ENERGY,Oil&Gas-Integrated,US30231G1022,,,"No (NYSE, XOM)",FRANKFURT,GERMANY
4,WBK,Westpac Banking Corp Adr,Financial,27.99,-0.37,-1.304654,-8.477119,3780,432467,14.0,AUSTRALIAN BANK PROVIDING BANKING/RELATED FINA...,BANKS,Banks-Foreign,US9612143019,Bank-Money Center,AUSTRALIA,Yes,NYSE,UNITED STATES


In [47]:
df.tail()

Unnamed: 0,symbol,name,sector,price,price_chg,price__chg,vol_rate,avg_dly__vol_000,mkt_val_mil_usd,pe_ratio,company_description,industry_sector,industry_group,isin,major_industry,headquarters,exchange_primary_listing,exchange,trading_country
494,DTEX.DE,Deutsche Telekom (Xet),Technology,12.45,-0.195,-1.542722,4.413403,124042,75343,23.0,DEUTSCHE TELEKOM AG IS A GERMANY-BASED INTEGRA...,TELECOM,Telecom Svcs-Integrated,DE0005557508,Telecommunication Equip,,Yes,XETRA,GERMANY
495,BG.GB,Bg Group,Energy,13.52,0.17,1.273885,58.304817,49949,75310,16.0,BG GROUP PLC (BG GROUP) IS A NATURAL GAS COMPA...,ENERGY,Oil&Gas-Integrated,GB0008762899,,Reading,Yes,LONDON,UNITED KINGDOM
496,FOXA,Twenty-First Cen Fx Cl A,Consumer Cyclical,32.47,-0.14,-0.429316,-26.565262,354171,75153,25.0,GLOBAL MEDIA AND ENTERTAINMENT COMPANY ENGAGED...,MEDIA,Media-Diversified,US90130A1016,Media - Radio/Tv,"New York, NY",Yes,NASDAQ,UNITED STATES
497,ANZ.AU,Aus.And Nz.Banking Gp.,Financial,31.09,0.12,0.387472,-9.287592,169869,75132,13.0,AUSTRALIA AND NEW ZEALAND BANKING GROUP LIMITE...,BANKS,Banks-Money Center,AU000000ANZ3,Bank-Money Center,Melbourne,Yes,AUSTRALIAN,AUSTRALIA
498,DEUT.IT,Deutsche Telekom (Mil),Technology,12.4,-0.17,-1.352426,-42.570255,193,75071,23.0,DEUTSCHE TELEKOM AG IS A GERMANY-BASED INTEGRA...,TELECOM,Telecom Svcs-Integrated,DE0005557508,Telecommunication Equip,,"No (XETRA, DTEX.DE)",MILAN,ITALY


# The Pandas DataFrames

<img src="img/dataframe.png"><img src="img/excel_table.png">

If the structure seems familiar, it's because DataFrames are very similar to a single Excel “Sheet”, but instead of referring to rows and columns with A1, yyou have the column numbers/names and row numbers.  

A DataFrame consists on three parts:

1. Index
2. Columns Names (Column Index)
3. Data

In [156]:
df.index

RangeIndex(start=0, stop=499, step=1)

In [49]:
df.columns

Index(['symbol', 'name', 'sector', 'price', 'price_chg', 'price__chg',
       'vol_rate', 'avg_dly__vol_000', 'mkt_val_mil_usd', 'pe_ratio',
       'company_description', 'industry_sector', 'industry_group', 'isin',
       'major_industry', 'headquarters', 'exchange_primary_listing',
       'exchange', 'trading_country'],
      dtype='object')

In [50]:
df.values

array([['APC.DE', 'Apple (Fra)', 'Technology', ..., 'No (NASDAQ, AAPL)',
        'FRANKFURT', 'GERMANY'],
       ['APCX.DE', 'Apple (Xet)', 'Technology', ..., 'No (NASDAQ, AAPL)',
        'XETRA', 'GERMANY'],
       ['AAPL', 'Apple Inc', 'Technology', ..., 'Yes', 'NASDAQ',
        'UNITED STATES'],
       ...,
       ['FOXA', 'Twenty-First Cen Fx Cl A', 'Consumer Cyclical', ...,
        'Yes', 'NASDAQ', 'UNITED STATES'],
       ['ANZ.AU', 'Aus.And Nz.Banking Gp.', 'Financial', ..., 'Yes',
        'AUSTRALIAN', 'AUSTRALIA'],
       ['DEUT.IT', 'Deutsche Telekom (Mil)', 'Technology', ...,
        'No (XETRA, DTEX.DE)', 'MILAN', 'ITALY']], dtype=object)

## Sort Data

In [51]:
df.sort_values("mkt_val_mil_usd", ascending=False)

Unnamed: 0,symbol,name,sector,price,price_chg,price__chg,vol_rate,avg_dly__vol_000,mkt_val_mil_usd,pe_ratio,company_description,industry_sector,industry_group,isin,major_industry,headquarters,exchange_primary_listing,exchange,trading_country
0,APC.DE,Apple (Fra),Technology,410.00,0.000,0.000000,-100.000000,1765,497751,14.0,"APPLE INC. (APPLE) DESIGNS, MANUFACTURES AND M...",COMPUTER,Computer-Hardware/Perip,US0378331005,,,"No (NASDAQ, AAPL)",FRANKFURT,GERMANY
1,APCX.DE,Apple (Xet),Technology,408.50,-2.000,-0.487211,-38.011541,6159,495930,14.0,"APPLE INC. (APPLE) DESIGNS, MANUFACTURES AND M...",COMPUTER,Computer-Hardware/Perip,US0378331005,,,"No (NASDAQ, AAPL)",XETRA,GERMANY
2,AAPL,Apple Inc,Technology,554.25,-3.110,-0.557987,-25.368715,6099413,494697,14.0,"MANUFACTURES PERSONAL COMPUTERS, MOBILE COMMUN...",COMPUTER,Computer-Hardware/Perip,US0378331005,Computer Data Storage,"Cupertino, CA",Yes,NASDAQ,UNITED STATES
3,XONA.DE,Exxon Mobil (Fra),Energy,73.20,0.000,0.000000,-100.000000,322,435010,13.0,EXXON MOBIL CORPORATION IS A MANUFACTURER AND ...,ENERGY,Oil&Gas-Integrated,US30231G1022,,,"No (NYSE, XOM)",FRANKFURT,GERMANY
4,WBK,Westpac Banking Corp Adr,Financial,27.99,-0.370,-1.304654,-8.477119,3780,432467,14.0,AUSTRALIAN BANK PROVIDING BANKING/RELATED FINA...,BANKS,Banks-Foreign,US9612143019,Bank-Money Center,AUSTRALIA,Yes,NYSE,UNITED STATES
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
494,DTEX.DE,Deutsche Telekom (Xet),Technology,12.45,-0.195,-1.542722,4.413403,124042,75343,23.0,DEUTSCHE TELEKOM AG IS A GERMANY-BASED INTEGRA...,TELECOM,Telecom Svcs-Integrated,DE0005557508,Telecommunication Equip,,Yes,XETRA,GERMANY
495,BG.GB,Bg Group,Energy,13.52,0.170,1.273885,58.304817,49949,75310,16.0,BG GROUP PLC (BG GROUP) IS A NATURAL GAS COMPA...,ENERGY,Oil&Gas-Integrated,GB0008762899,,Reading,Yes,LONDON,UNITED KINGDOM
496,FOXA,Twenty-First Cen Fx Cl A,Consumer Cyclical,32.47,-0.140,-0.429316,-26.565262,354171,75153,25.0,GLOBAL MEDIA AND ENTERTAINMENT COMPANY ENGAGED...,MEDIA,Media-Diversified,US90130A1016,Media - Radio/Tv,"New York, NY",Yes,NASDAQ,UNITED STATES
497,ANZ.AU,Aus.And Nz.Banking Gp.,Financial,31.09,0.120,0.387472,-9.287592,169869,75132,13.0,AUSTRALIA AND NEW ZEALAND BANKING GROUP LIMITE...,BANKS,Banks-Money Center,AU000000ANZ3,Bank-Money Center,Melbourne,Yes,AUSTRALIAN,AUSTRALIA


## Selecting Data (by Columns)

In [108]:
df['symbol']

0       APC.DE
1      APCX.DE
2         AAPL
3      XONA.DE
4          WBK
        ...   
494    DTEX.DE
495      BG.GB
496       FOXA
497     ANZ.AU
498    DEUT.IT
Name: symbol, Length: 499, dtype: object

OR, can use column name:

In [109]:
df.symbol

0       APC.DE
1      APCX.DE
2         AAPL
3      XONA.DE
4          WBK
        ...   
494    DTEX.DE
495      BG.GB
496       FOXA
497     ANZ.AU
498    DEUT.IT
Name: symbol, Length: 499, dtype: object

In [110]:
# Getting more than one column
df[['symbol', 'name','sector']]

Unnamed: 0,symbol,name,sector
0,APC.DE,Apple (Fra),Technology
1,APCX.DE,Apple (Xet),Technology
2,AAPL,Apple Inc,Technology
3,XONA.DE,Exxon Mobil (Fra),Energy
4,WBK,Westpac Banking Corp Adr,Financial
...,...,...,...
494,DTEX.DE,Deutsche Telekom (Xet),Technology
495,BG.GB,Bg Group,Energy
496,FOXA,Twenty-First Cen Fx Cl A,Consumer Cyclical
497,ANZ.AU,Aus.And Nz.Banking Gp.,Financial


# Selecting Data

## Label vs. Location

### df.loc[row_labels, column_labels] - select data (rows and/or columns) with particular label(s).

Allowed inputs are:

* An integer, e.g. 5.

* A list or array of integers, e.g. [4, 3, 0].

* A slice object with ints, e.g. 1:7.

* A boolean array.

### df.iloc[row_positions, column_positions] - select data (rows and/or columns) at integer locations.

Allowed inputs are:

* A single label, e.g. 5 or 'a', (note that 5 is interpreted as a label of the index, and never as an integer position along the index).

* A list or array of labels, e.g. ['a', 'b', 'c'].

* A slice object with labels, e.g. 'a':'f'.

* A boolean array of the same length as the axis being sliced, e.g. [True, False, True].


### Selecting a single row by position
Selecting a single row by position

In [136]:
# selects series
df.loc[3]

symbol                                                                XONA.DE
name                                                        Exxon Mobil (Fra)
sector                                                                 Energy
price                                                                    73.2
price_chg                                                                 0.0
price__chg                                                                0.0
vol_rate                                                               -100.0
avg_dly__vol_000                                                          322
mkt_val_mil_usd                                                        435010
pe_ratio                                                                 13.0
company_description         EXXON MOBIL CORPORATION IS A MANUFACTURER AND ...
industry_sector                                                        ENERGY
industry_group                                             Oil&G

In [137]:
# selects series
df.iloc[3]

symbol                                                                XONA.DE
name                                                        Exxon Mobil (Fra)
sector                                                                 Energy
price                                                                    73.2
price_chg                                                                 0.0
price__chg                                                                0.0
vol_rate                                                               -100.0
avg_dly__vol_000                                                          322
mkt_val_mil_usd                                                        435010
pe_ratio                                                                 13.0
company_description         EXXON MOBIL CORPORATION IS A MANUFACTURER AND ...
industry_sector                                                        ENERGY
industry_group                                             Oil&G

In [135]:
# select the single row dataframe
df.loc[[3]]

Unnamed: 0,symbol,name,sector,price,price_chg,price__chg,vol_rate,avg_dly__vol_000,mkt_val_mil_usd,pe_ratio,company_description,industry_sector,industry_group,isin,major_industry,headquarters,exchange_primary_listing,exchange,trading_country
3,XONA.DE,Exxon Mobil (Fra),Energy,73.2,0.0,0.0,-100.0,322,435010,13.0,EXXON MOBIL CORPORATION IS A MANUFACTURER AND ...,ENERGY,Oil&Gas-Integrated,US30231G1022,,,"No (NYSE, XOM)",FRANKFURT,GERMANY


### select the last row of the data frame - input: integer - output: series

In [138]:
df.iloc[-1]

symbol                                                                DEUT.IT
name                                                   Deutsche Telekom (Mil)
sector                                                             Technology
price                                                                    12.4
price_chg                                                               -0.17
price__chg                                                          -1.352426
vol_rate                                                           -42.570255
avg_dly__vol_000                                                          193
mkt_val_mil_usd                                                         75071
pe_ratio                                                                 23.0
company_description         DEUTSCHE TELEKOM AG IS A GERMANY-BASED INTEGRA...
industry_sector                                                       TELECOM
industry_group                                        Telecom Sv

In [139]:
# select the last row of the data frame - input: list - output: data frame
df.iloc[[-1]]

Unnamed: 0,symbol,name,sector,price,price_chg,price__chg,vol_rate,avg_dly__vol_000,mkt_val_mil_usd,pe_ratio,company_description,industry_sector,industry_group,isin,major_industry,headquarters,exchange_primary_listing,exchange,trading_country
498,DEUT.IT,Deutsche Telekom (Mil),Technology,12.4,-0.17,-1.352426,-42.570255,193,75071,23.0,DEUTSCHE TELEKOM AG IS A GERMANY-BASED INTEGRA...,TELECOM,Telecom Svcs-Integrated,DE0005557508,Telecommunication Equip,,"No (XETRA, DTEX.DE)",MILAN,ITALY


# Selecting multiple rows by position
To extract multiple rows by position, we pass either a list or a slice object to the .iloc[] indexer.

#### df.loc[row_labels, column_labels]

#### df.iloc[row_positions, column_positions]


By integer slices, acting similar to numpy/Python:

In [153]:
# Selecting rows and columns simultaneously
rows_to_select = [22, 45, 66, 234]

# using loc
df.loc[rows_to_select]

Unnamed: 0,symbol,name,sector,price,price_chg,price__chg,vol_rate,avg_dly__vol_000,mkt_val_mil_usd,pe_ratio,company_description,industry_sector,industry_group,isin,major_industry,headquarters,exchange_primary_listing,exchange,trading_country
22,NSRGY,Nestle S A Adr Spon,Consumer Staple,74.78,1.25,1.699986,-11.522471,37140,272947,21.0,"MANUFACTURES BEVERAGES, MILK PRODUCTS, ICE CRE...",FOOD/BEV,Food-Packaged,US6410694060,Food,SWITZERLAND,Yes,OTC,UNITED STATES
45,CVX,Chevron Corp,Energy,118.83,-0.35,-0.293673,-12.16909,681149,228530,10.0,"ENGAGED IN EXPLORATION, PRODUCTION, REFINING A...",ENERGY,Oil&Gas-Integrated,US1667641005,Oil&Gas-Integrated,"San Ramon, CA",Yes,NYSE,UNITED STATES
66,TYT.GB,Toyota Motor (Xsq),Consumer Cyclical,6285.66,53.785,0.863062,-89.918662,1097351,207913,14.0,TOYOTA MOTOR CORPORATION IS A JAPAN-BASED COMP...,AUTO,Auto Manufacturers,JP3633400001,Automobile,Aichi,"No (TOKYO, TYMO.JP)",SEAQ INTERNATION,UNITED KINGDOM
234,SCLX.DE,Schlumberger (Xet),Energy,64.1,-0.85,-1.308699,-92.407809,30,114821,18.0,SCHLUMBERGER LIMITED (SCHLUMBERGER N.V.) IS TH...,ENERGY,Oil&Gas-Field Services,AN8068571086,Oil&Gas-Equipment & Svc,,"No (NYSE, SLB)",XETRA,GERMANY


In [154]:
# we can select 
df['symbol'].loc[rows_to_select]

22       NSRGY
45         CVX
66      TYT.GB
234    SCLX.DE
Name: symbol, dtype: object

In [155]:
# using iloc method
df.iloc[rows_to_select]

Unnamed: 0,symbol,name,sector,price,price_chg,price__chg,vol_rate,avg_dly__vol_000,mkt_val_mil_usd,pe_ratio,company_description,industry_sector,industry_group,isin,major_industry,headquarters,exchange_primary_listing,exchange,trading_country
22,NSRGY,Nestle S A Adr Spon,Consumer Staple,74.78,1.25,1.699986,-11.522471,37140,272947,21.0,"MANUFACTURES BEVERAGES, MILK PRODUCTS, ICE CRE...",FOOD/BEV,Food-Packaged,US6410694060,Food,SWITZERLAND,Yes,OTC,UNITED STATES
45,CVX,Chevron Corp,Energy,118.83,-0.35,-0.293673,-12.16909,681149,228530,10.0,"ENGAGED IN EXPLORATION, PRODUCTION, REFINING A...",ENERGY,Oil&Gas-Integrated,US1667641005,Oil&Gas-Integrated,"San Ramon, CA",Yes,NYSE,UNITED STATES
66,TYT.GB,Toyota Motor (Xsq),Consumer Cyclical,6285.66,53.785,0.863062,-89.918662,1097351,207913,14.0,TOYOTA MOTOR CORPORATION IS A JAPAN-BASED COMP...,AUTO,Auto Manufacturers,JP3633400001,Automobile,Aichi,"No (TOKYO, TYMO.JP)",SEAQ INTERNATION,UNITED KINGDOM
234,SCLX.DE,Schlumberger (Xet),Energy,64.1,-0.85,-1.308699,-92.407809,30,114821,18.0,SCHLUMBERGER LIMITED (SCHLUMBERGER N.V.) IS TH...,ENERGY,Oil&Gas-Field Services,AN8068571086,Oil&Gas-Equipment & Svc,,"No (NYSE, SLB)",XETRA,GERMANY


In [None]:
# select the first five rows of the dataframe using slice notation
df.iloc[0:5]

In [None]:
# Selecting a single row and multiple columns
# select the name, surname, and salary of the employee with id number 478 by position
df.iloc[1, [0, 1, 3]]

# select the name, surname, and salary of the employee with id number 478 by label
df.loc['478', ['name', 'surname', 'salary']]


### For getting a value (cell) explicitly:


In [140]:
df.iloc[2, 1]

'Apple Inc'

For getting fast access to a scalar (equivalent to the prior method):

In [143]:
df.iat[2, 1]

'Apple Inc'

In [None]:
# For slicing rows explicitly:
df.iloc[1:3, :]

22       NSRGY
45         CVX
66      TYT.GB
234    SCLX.DE
Name: symbol, dtype: object

In [105]:
# The .iloc[] indexer is used to index a data frame by position.
df.iloc[2]

symbol                                                                   AAPL
name                                                                Apple Inc
sector                                                             Technology
price                                                                  554.25
price_chg                                                               -3.11
price__chg                                                          -0.557987
vol_rate                                                           -25.368715
avg_dly__vol_000                                                      6099413
mkt_val_mil_usd                                                        494697
pe_ratio                                                                 14.0
company_description         MANUFACTURES PERSONAL COMPUTERS, MOBILE COMMUN...
industry_sector                                                      COMPUTER
industry_group                                        Computer-H

In [106]:
# To extract multiple rows by position, we pass either a list or a slice object to the .iloc[] indexer.
df.iloc[[2]]

Unnamed: 0,symbol,name,sector,price,price_chg,price__chg,vol_rate,avg_dly__vol_000,mkt_val_mil_usd,pe_ratio,company_description,industry_sector,industry_group,isin,major_industry,headquarters,exchange_primary_listing,exchange,trading_country
2,AAPL,Apple Inc,Technology,554.25,-3.11,-0.557987,-25.368715,6099413,494697,14.0,"MANUFACTURES PERSONAL COMPUTERS, MOBILE COMMUN...",COMPUTER,Computer-Hardware/Perip,US0378331005,Computer Data Storage,"Cupertino, CA",Yes,NASDAQ,UNITED STATES


In [107]:
df.iloc[rows_to_select]

Unnamed: 0,symbol,name,sector,price,price_chg,price__chg,vol_rate,avg_dly__vol_000,mkt_val_mil_usd,pe_ratio,company_description,industry_sector,industry_group,isin,major_industry,headquarters,exchange_primary_listing,exchange,trading_country
22,NSRGY,Nestle S A Adr Spon,Consumer Staple,74.78,1.25,1.699986,-11.522471,37140,272947,21.0,"MANUFACTURES BEVERAGES, MILK PRODUCTS, ICE CRE...",FOOD/BEV,Food-Packaged,US6410694060,Food,SWITZERLAND,Yes,OTC,UNITED STATES
45,CVX,Chevron Corp,Energy,118.83,-0.35,-0.293673,-12.16909,681149,228530,10.0,"ENGAGED IN EXPLORATION, PRODUCTION, REFINING A...",ENERGY,Oil&Gas-Integrated,US1667641005,Oil&Gas-Integrated,"San Ramon, CA",Yes,NYSE,UNITED STATES
66,TYT.GB,Toyota Motor (Xsq),Consumer Cyclical,6285.66,53.785,0.863062,-89.918662,1097351,207913,14.0,TOYOTA MOTOR CORPORATION IS A JAPAN-BASED COMP...,AUTO,Auto Manufacturers,JP3633400001,Automobile,Aichi,"No (TOKYO, TYMO.JP)",SEAQ INTERNATION,UNITED KINGDOM
234,SCLX.DE,Schlumberger (Xet),Energy,64.1,-0.85,-1.308699,-92.407809,30,114821,18.0,SCHLUMBERGER LIMITED (SCHLUMBERGER N.V.) IS TH...,ENERGY,Oil&Gas-Field Services,AN8068571086,Oil&Gas-Equipment & Svc,,"No (NYSE, SLB)",XETRA,GERMANY


In [99]:
# Getting a single value
df.loc[1,'company_description']

"APPLE INC. (APPLE) DESIGNS, MANUFACTURES AND MARKETS MOBILE COMMUNICATION AND MEDIA DEVICES, PERSONAL COMPUTERS, AND PORTABLE DIGITAL MUSIC PLAYERS, AND SELLS A VARIETY OF RELATED SOFTWARE, SERVICES, PERIPHERALS, NETWORKING SOLUTIONS, AND THIRD-PARTY DIGITAL CONTENT AND APPLICATIONS. THE COMPANY'S PRODUCTS AND SERVICES INCLUDE IPHONE, IPAD, MAC, IPOD, APPLE TV, A PORTFOLIO OF CONSUMER AND PROFESSIONAL SOFTWARE APPLICATIONS, THE IOS AND OS X OPERATING SYSTEMS, ICLOUD, AND A VARIETY OF ACCESSORY, SERVICE AND SUPPORT OFFERINGS. IN MARCH 2013, THE COMPANY ACQUIRED A SILICON VALLEY STARTUP, WIFISLAM, WHICH MAKES MAPPING APPLICATIONS FOR SMART PHONES. EFFECTIVE JULY 19, 2013, APPLE ACQUIRED LOCATIONARY INC. EFFECTIVE JULY 20, 2013, APPLE ACQUIRED HOPSTOP.COM INC. EFFECTIVE AUGUST 28, 2013, APPLE ACQUIRED ALGOTRIM AB, A MALMO-BASED DEVELOPER OF PREPACKAGED SOFTWARE. IN NOVEMBER 2013, APPLE BOUGHT PRIMESE"

In [None]:
See more at Selection by Label.

* df.iloc is primarily integer position based (from 0 to length-1 of the axis), but may also be used with a boolean array. .iloc will raise IndexError if a requested indexer is out-of-bounds, except slice indexers which allow out-of-bounds indexing. (this conforms with Python/NumPy slice semantics). Allowed inputs are:

    * An integer e.g. 5.

    * A list or array of integers [4, 3, 0].

    * A slice object with ints 1:7.

    * A boolean array (any NA values will be treated as False).

    * A callable function with one argument (the calling Series or DataFrame) and that returns valid output for indexing (one of the above).

It is also possible to select by position using the *iloc* method

## Describe data quick statistical summary

In [None]:
df.describe()

## Transpose your data

In [None]:
df.T

## Answering simple questions about a dataset

### How many companies are there by Sector?

In [74]:
df['sector'].value_counts()

Financial            116
Technology            89
Energy                75
Health Care           63
Consumer Staple       52
Capital Equipment     32
Consumer Cyclical     28
Retail                21
Basic Material        20
Transportation         3
Name: sector, dtype: int64

### What is the average Market Capitalization?

In [75]:
df['mkt_val_mil_usd'].mean()

137938.5651302605

### What is the most frequent sector?

In [76]:
df['sector'].describe()

count           499
unique           10
top       Financial
freq            116
Name: sector, dtype: object

## Who are the 5 largest companies?

In [77]:
df.sort_values('mkt_val_mil_usd', ascending=False)[:5]

Unnamed: 0.1,Unnamed: 0,symbol,name,sector,price,price_chg,price__chg,vol_rate,avg_dly__vol_000,mkt_val_mil_usd,pe_ratio,company_description,industry_sector,industry_group,isin,major_industry,headquarters,exchange_primary_listing,exchange,trading_country
0,0,APC.DE,Apple (Fra),Technology,410.0,0.0,0.0,-100.0,1765,497751,14.0,"APPLE INC. (APPLE) DESIGNS, MANUFACTURES AND M...",COMPUTER,Computer-Hardware/Perip,US0378331005,,,"No (NASDAQ, AAPL)",FRANKFURT,GERMANY
1,1,APCX.DE,Apple (Xet),Technology,408.5,-2.0,-0.487211,-38.011541,6159,495930,14.0,"APPLE INC. (APPLE) DESIGNS, MANUFACTURES AND M...",COMPUTER,Computer-Hardware/Perip,US0378331005,,,"No (NASDAQ, AAPL)",XETRA,GERMANY
2,2,AAPL,Apple Inc,Technology,554.25,-3.11,-0.557987,-25.368715,6099413,494697,14.0,"MANUFACTURES PERSONAL COMPUTERS, MOBILE COMMUN...",COMPUTER,Computer-Hardware/Perip,US0378331005,Computer Data Storage,"Cupertino, CA",Yes,NASDAQ,UNITED STATES
3,3,XONA.DE,Exxon Mobil (Fra),Energy,73.2,0.0,0.0,-100.0,322,435010,13.0,EXXON MOBIL CORPORATION IS A MANUFACTURER AND ...,ENERGY,Oil&Gas-Integrated,US30231G1022,,,"No (NYSE, XOM)",FRANKFURT,GERMANY
4,4,WBK,Westpac Banking Corp Adr,Financial,27.99,-0.37,-1.304654,-8.477119,3780,432467,14.0,AUSTRALIAN BANK PROVIDING BANKING/RELATED FINA...,BANKS,Banks-Foreign,US9612143019,Bank-Money Center,AUSTRALIA,Yes,NYSE,UNITED STATES


# Boolean Indexing
Boolean indexing
Another common operation is the use of boolean vectors to filter the data. The operators are: | for or, & for and, and ~ for not.
These must be grouped by using parentheses, since by default Python will evaluate an expression such as df['A'] > 2 & df['B'] < 3 as df['A'] > (2 & df['B']) < 3, while the desired evaluation order is (df['A'] > 2) & (df['B'] < 3).

In [83]:
df[df.mkt_val_mil_usd > 425010]

Unnamed: 0.1,Unnamed: 0,symbol,name,sector,price,price_chg,price__chg,vol_rate,avg_dly__vol_000,mkt_val_mil_usd,pe_ratio,company_description,industry_sector,industry_group,isin,major_industry,headquarters,exchange_primary_listing,exchange,trading_country
0,0,APC.DE,Apple (Fra),Technology,410.0,0.0,0.0,-100.0,1765,497751,14.0,"APPLE INC. (APPLE) DESIGNS, MANUFACTURES AND M...",COMPUTER,Computer-Hardware/Perip,US0378331005,,,"No (NASDAQ, AAPL)",FRANKFURT,GERMANY
1,1,APCX.DE,Apple (Xet),Technology,408.5,-2.0,-0.487211,-38.011541,6159,495930,14.0,"APPLE INC. (APPLE) DESIGNS, MANUFACTURES AND M...",COMPUTER,Computer-Hardware/Perip,US0378331005,,,"No (NASDAQ, AAPL)",XETRA,GERMANY
2,2,AAPL,Apple Inc,Technology,554.25,-3.11,-0.557987,-25.368715,6099413,494697,14.0,"MANUFACTURES PERSONAL COMPUTERS, MOBILE COMMUN...",COMPUTER,Computer-Hardware/Perip,US0378331005,Computer Data Storage,"Cupertino, CA",Yes,NASDAQ,UNITED STATES
3,3,XONA.DE,Exxon Mobil (Fra),Energy,73.2,0.0,0.0,-100.0,322,435010,13.0,EXXON MOBIL CORPORATION IS A MANUFACTURER AND ...,ENERGY,Oil&Gas-Integrated,US30231G1022,,,"No (NYSE, XOM)",FRANKFURT,GERMANY
4,4,WBK,Westpac Banking Corp Adr,Financial,27.99,-0.37,-1.304654,-8.477119,3780,432467,14.0,AUSTRALIAN BANK PROVIDING BANKING/RELATED FINA...,BANKS,Banks-Foreign,US9612143019,Bank-Money Center,AUSTRALIA,Yes,NYSE,UNITED STATES
5,5,XOM,Exxon Mobil Corp,Energy,98.94,0.16,0.161976,-22.553492,1201688,432220,13.0,"ENGAGED IN THE EXPLORATION, PRODUCTION, TRANSP...",ENERGY,Oil&Gas-Integrated,US30231G1022,Oil&Gas-Integrated,"Irving, TX",Yes,NYSE,UNITED STATES
6,6,XOAX.DE,Exxon Mobil (Xet),Energy,72.32,-0.39,-0.536377,-63.128791,405,429768,13.0,EXXON MOBIL CORPORATION IS A MANUFACTURER AND ...,ENERGY,Oil&Gas-Integrated,US30231G1022,,,"No (NYSE, XOM)",XETRA,GERMANY


In [None]:
df[~(df.industry_sector < 0)]

Indexing with isin


In [85]:
df.symbol

0       APC.DE
1      APCX.DE
2         AAPL
3      XONA.DE
4          WBK
        ...   
494    DTEX.DE
495      BG.GB
496       FOXA
497     ANZ.AU
498    DEUT.IT
Name: symbol, Length: 499, dtype: object

In [88]:
df.symbol.isin(["AAPL", "FB", "WBK", "DTEX.DE", "FOXA", "MSFT"])

0      False
1      False
2       True
3      False
4       True
       ...  
494     True
495    False
496     True
497    False
498    False
Name: symbol, Length: 499, dtype: bool

In [89]:
df[df.symbol.isin(["AAPL", "FB", "WBK", "DTEX.DE", "FOXA", "MSFT"])]

Unnamed: 0.1,Unnamed: 0,symbol,name,sector,price,price_chg,price__chg,vol_rate,avg_dly__vol_000,mkt_val_mil_usd,pe_ratio,company_description,industry_sector,industry_group,isin,major_industry,headquarters,exchange_primary_listing,exchange,trading_country
2,2,AAPL,Apple Inc,Technology,554.25,-3.11,-0.557987,-25.368715,6099413,494697,14.0,"MANUFACTURES PERSONAL COMPUTERS, MOBILE COMMUN...",COMPUTER,Computer-Hardware/Perip,US0378331005,Computer Data Storage,"Cupertino, CA",Yes,NASDAQ,UNITED STATES
4,4,WBK,Westpac Banking Corp Adr,Financial,27.99,-0.37,-1.304654,-8.477119,3780,432467,14.0,AUSTRALIAN BANK PROVIDING BANKING/RELATED FINA...,BANKS,Banks-Foreign,US9612143019,Bank-Money Center,AUSTRALIA,Yes,NYSE,UNITED STATES
11,11,MSFT,Microsoft Corp,Technology,36.89,0.13,0.353645,-4.880698,1468683,307956,13.0,"DEVELOPS OPERATING SYSTEMS, BUSINESS SOFTWARE,...",SOFTWARE,Computer Sftwr-Desktop,US5949181045,Computer-Software,"Redmond, WA",Yes,NASDAQ,UNITED STATES
160,160,FB,Facebook Inc Cl A,Technology,57.19,-0.41,-0.711805,-48.25063,3840670,145274,79.0,PROVIDES A SOCIAL NETWORKING PLATFORM ENABLING...,INTERNET,Internet-Content,US30303M1027,Internet,"Menlo Park, CA",Yes,NASDAQ,UNITED STATES
494,494,DTEX.DE,Deutsche Telekom (Xet),Technology,12.45,-0.195,-1.542722,4.413403,124042,75343,23.0,DEUTSCHE TELEKOM AG IS A GERMANY-BASED INTEGRA...,TELECOM,Telecom Svcs-Integrated,DE0005557508,Telecommunication Equip,,Yes,XETRA,GERMANY
496,496,FOXA,Twenty-First Cen Fx Cl A,Consumer Cyclical,32.47,-0.14,-0.429316,-26.565262,354171,75153,25.0,GLOBAL MEDIA AND ENTERTAINMENT COMPANY ENGAGED...,MEDIA,Media-Diversified,US90130A1016,Media - Radio/Tv,"New York, NY",Yes,NASDAQ,UNITED STATES


## Further questions

In [87]:
df.columns

### Give me the list of the companies in the Technology Sector

In [83]:
df['Sector'] == 'Technology'

We can use a boolean series to index a Series or a DataFrame, this is called "Masking" or boolean indexing.

In [86]:
df.loc[df['Sector'] == 'Technology']

### Give me the list of the companies in the Technology Sector and United States

In [89]:
df.loc[(df['Sector'] == 'Technology') & (df['Trading Country'] == 'UNITED STATES')]

**Grouping operations**: Split-Apply-Combine operation.

By **gourping** or **group by** operations we are referring to a process involving one or more of the following steps:

- **Splitting** the data into groups based on some criteria
- **Applying** a function to each group independently
- **Combining** the results into a data structure


<img src="img/split_apply_combine.png">

<b>Step1 (Split): </b> The <i>groupby</i> operation <b><i>splits</b></i> the dataframe into a group of dataframes based on some criteria. Note that the grouped object is <i>not</i> a dataframe. It is a GroupBy object. It has a dictionary-like structure and is also iterable.

<b>Step 2 (Analyze):</b> Once we have a grouped object we can <b><i>apply</b></i> functions or run analysis to each group, set of groups, or the entire group.

<b>Step 3 (Combine):</b> We can also <b><i>combine</b></i> the results of the analysis into a new data structure(s).

Since we are only interested in the employees with "Low" and "Very High" JobSatisfaction levels, let's create a new DataFrame containing only those observations.

In [113]:
subset_of_interest = df.loc[(df['Sector'] == "Technology") | (df['Sector'] == "Energy")]

subset_of_interest.shape

Since our JobSatisfaction variable had 4 categories, this categories have stayed in the series of this new DataFrame:

In [114]:
subset_of_interest['Sector'].value_counts()

Let's remove those categories we won't be using:

In [116]:
subset_of_interest['Sector'].value_counts()

Now we have only the employees we are interested in, we can now compare accross the variables we wanted. First let's split our new DataFrame into groups.

In [117]:
grouped = subset_of_interest.groupby('Sector')

In [118]:
grouped.groups

In [120]:
grouped.get_group('Energy').head()

#### Age

In [121]:
grouped['P/E Ratio']

In [122]:
grouped['P/E Ratio'].mean()

In [123]:
grouped['P/E Ratio'].describe()

In [124]:
grouped['P/E Ratio'].describe().unstack()

#### Department

In [125]:
grouped['Industry Group'].value_counts().unstack()

In [126]:
100 * grouped['Industry Group'].value_counts(normalize=True).unstack()

#### DistanceFromHome

In [127]:
grouped['Trading Country'].describe().unstack()

#### HourlyRate

In [128]:
grouped['P/E Ratio'].describe().unstack()

#### MonthlyIncome

In [129]:
grouped['Mkt Val (Mil), USD'].describe().unstack()

#### YearsAtCompany

In [130]:
grouped['Avg Dly $ Vol (000)'].describe().unstack()

### Comparing the means across all numerical variables

Although we we asked for just some specific columns, to give the HR director a better picture of how these groups compare across different variables, let's create a DataFrame that contains the mean for every numeric variable in our dataset.

In [None]:
# Getting the numerical columns
numeric_cols = subset_of_interest.select_dtypes(include=[np.number]).columns

In [None]:
# Creating an empty DataFrame
mean_comparison_df = pd.DataFrame(columns=numeric_cols, index=['Low', 'Very High'])
mean_comparison_df

In [None]:
grouped['Age'].mean()

In [None]:
# Filling the DataFrame
for var in numeric_cols:
    mean_comparison_df[var] = grouped[var].mean()

In [None]:
mean_comparison_df

In [None]:
mean_comparison_df = mean_comparison_df.transpose()
mean_comparison_df

### Let's do a visualization

In [None]:
mean_comparison_df.plot(kind='bar', figsize=(13,4),
                                   title="Comparison of Means");

In [None]:
overal_means = df.mean()
normalized_mean_comparison_df = mean_comparison_df.copy()

In [None]:
overal_means = df.mean()
normalized_mean_comparison_df['Low'] = mean_comparison_df['Low'] / overal_means
normalized_mean_comparison_df['Very High'] = mean_comparison_df['Very High'] / overal_means

In [None]:
normalized_mean_comparison_df.plot(kind='bar', figsize=(13,4),
                                   title="Comparison of Normalized Means")
plt.legend(loc='lower left', bbox_to_anchor=(0.16, 1.0))
plt.text(x=-0.2, y = 1.2, s="JobSatisfaction:", fontdict={'size':14});

