**Fin 585R**  
**Diether**  
**Problem Set**  
**Intro to Python/Pandas**  

**Overview**

This problem set is designed to introduce you to using Python for empirical analysis. You can discuss this problem set and get coding help from other students in the class. But you must prepare your own answers. This assignment will be graded based on effort. Do your best, don’t worry if you can’t answer all the questions. We will discuss the problem set in class. You may find [Pandas Documentation](http://pandas.pydata.org/) useful.

*Please submit you Jupyter notebook to Learning Suite before class.*


**Learning Objectives**

The goal of this homework is to give you practice with some of the core concepts I highlighted in the introduction:

+ Getting data into a dataframe and working with this core data structure. For example, hopefully you take advantage of a dataframe's built in functions (methods) to answer some of the questions.<br><br>

+ Printing out data using various methods.<br><br>

+ Selecting data and creating variaables using if/then/else logic.<br><br>

+ Your first use of the groupby/apply pandas programming framework.<br><br>


**Data Analysis with Python/Pandas**

You can download the data for the problem set here: [Monthly Stock Return and Analyst Data](http://diether.org/prephd/02-mstk_analysts.csv). There is also a link to the data on the schedule page of *Learning Suite*. The data are monthly observations for all stocks listed in the United States during 2020 The data contain the following variables:

|Variable | Description                                              |
|---------|----------------------------------------------------------|
|permno   | stock identifier                                         |
|caldt    | calendar date                                            |
|ticker   | another stock identifier                                 |
|ret      | monthly return                                           |
|prc      | stock price                                              |   
|me       | market value of equity (in millions)                     |
|analysts | number of analysts covering the stock                    |


**Tasks and Questions**

1. Print out the first 10 observations of the data.<br><br>

2. Create a new column in the dataframe the contains the natural log of (1 + analysts).<br><br>

3. During June of 2020, what is the price of Hormel's stock? Note, the ticker symbol for Hormel is HRL.<br><br>

4. What is the average number of analysts covering Tesla this year? Note, the ticker symbol for Tesla is TSLA.<br><br>

5. Creat a new column in the dataframe that is True if the number of analysts is greater than 10 and False otherwise. <br><br>

6. Harder questions: questions 6-8 increase the difficulty. Do your best. Hint, use the `groupby` command. Compute the average market-cap (the column named me) for observations with more than 10 analysts and observations with less than 10 analysts. <br><br>

7. Compute the number of stocks in the dataframe by month.<br><br>

8. Compute the aggregate market-cap of all stocks in the dataframe by month.<br><br>

9. Create a new dataframe (call it sub) that contains all the observations of Google (ticker=MSFT) and Microsoft (MSFT).<br><br>

In [10]:
import numpy as np
import pandas as pd
import math

In [11]:
df = pd.read_csv('http://diether.org/prephd/02-mstk_analysts.csv',parse_dates=['caldt'])

In [12]:
df.head(10)

Unnamed: 0,permno,caldt,ticker,prc,ret,me,analysts
0,10026,2020-01-31,JJSF,165.84,-0.100016,3137.52696,3.0
1,10026,2020-02-28,JJSF,160.82001,-0.03027,3042.553769,2.0
2,10026,2020-03-31,JJSF,121.0,-0.244031,2285.448,2.0
3,10026,2020-04-30,JJSF,127.03,0.049835,2399.34264,2.0
4,10026,2020-05-29,JJSF,128.63,0.012595,2429.56344,2.0
5,10026,2020-06-30,JJSF,127.13,-0.007191,2401.23144,2.0
6,10026,2020-07-31,JJSF,123.13,-0.031464,2326.54135,3.0
7,10026,2020-08-31,JJSF,135.95,0.104118,2568.77525,3.0
8,10026,2020-09-30,JJSF,130.39,-0.036668,2466.32685,3.0
9,10026,2020-10-30,JJSF,135.57001,0.039727,2564.306739,2.0


# Permno comes from the CRISP database
They are better to use than the ticker due to recycled ticker names for collapsed companies

**When an analyst follows a stock,**
they give buy/sell recommendations and generate forecasts for future stock prices

In [13]:
def gen_log_values(row):
    log_score = row.analysts
    log_score = math.log(1 + log_score)
    return log_score
df["ln_analyst"] = df.apply(gen_log_values, axis=1)
df.head(3)

Unnamed: 0,permno,caldt,ticker,prc,ret,me,analysts,ln_analyst
0,10026,2020-01-31,JJSF,165.84,-0.100016,3137.52696,3.0,1.386294
1,10026,2020-02-28,JJSF,160.82001,-0.03027,3042.553769,2.0,1.098612
2,10026,2020-03-31,JJSF,121.0,-0.244031,2285.448,2.0,1.098612


# Standard way to compute log:
`df.apply` is slow; use `df["loganalyst"] = np.log(1 + df['analysts'])`


`df.apply` uses pure python looping. Another option is to use `df.eval("log(1+analysts")`

In [14]:
df[(df["ticker"] == "HRL") & (df["caldt"] > '2020-06-01') & (df["caldt"] < '2020-07-01')]

Unnamed: 0,permno,caldt,ticker,prc,ret,me,analysts,ln_analyst
21945,32870,2020-06-30,HRL,48.27,-0.011468,26015.21304,11.0,2.484907


# Can also use query syntax:
`df.query("ticker == 'HRL' and caldt == '2020-06-30'")`


Query uses python syntax


To use an external variable, use `@__varname__`

In [15]:
df[df["ticker"] == "TSLA"].analysts.mean()

22.916666666666668

**Query Alternative:**


`df.query("ticker == 'TSLA'")["analysts"].mean()`

Avoid using brackets against brackets, as well as `.__columnName__`

In [16]:
df["gr_10"] = df["analysts"] > 10
df

Unnamed: 0,permno,caldt,ticker,prc,ret,me,analysts,ln_analyst,gr_10
0,10026,2020-01-31,JJSF,165.84000,-0.100016,3137.526960,3.0,1.386294,False
1,10026,2020-02-28,JJSF,160.82001,-0.030270,3042.553769,2.0,1.098612,False
2,10026,2020-03-31,JJSF,121.00000,-0.244031,2285.448000,2.0,1.098612,False
3,10026,2020-04-30,JJSF,127.03000,0.049835,2399.342640,2.0,1.098612,False
4,10026,2020-05-29,JJSF,128.63000,0.012595,2429.563440,2.0,1.098612,False
...,...,...,...,...,...,...,...,...,...
43485,93436,2020-08-31,TSLA,498.32001,0.741452,464339.070198,21.0,3.091042,True
43486,93436,2020-09-30,TSLA,429.01001,-0.139087,406701.489480,23.0,3.178054,True
43487,93436,2020-10-30,TSLA,388.04001,-0.095499,367823.513519,23.0,3.178054,True
43488,93436,2020-11-30,TSLA,567.59998,0.462736,538028.588642,23.0,3.178054,True


In [19]:
df.groupby("gr_10").aggregate("mean").me

gr_10
False     1534.442337
True     35443.189887
Name: me, dtype: float64

# No need to use aggregate here:
Just use `df.groupby("gr_10")["me"].mean()`

In [24]:
def gen_month(row):
    date = row.caldt
    return date.month
df["Month"] = df.apply(gen_month, axis=1)
df.groupby("Month").agg('count')["ticker"]

Month
1     3596
2     3598
3     3591
4     3585
5     3580
6     3585
7     3592
8     3605
9     3634
10    3674
11    3707
12    3743
Name: ticker, dtype: int64

This is a normal amount of different stocks traded per month

In [22]:
df.groupby("Month").agg('sum')["me"].round(0)

Month
1     31551231.0
2     28946476.0
3     25036895.0
4     28398047.0
5     29964627.0
6     30735085.0
7     32553289.0
8     35036827.0
9     33823209.0
10    33120560.0
11    37269173.0
12    39107523.0
Name: me, dtype: float64

In [25]:
sub = df[(df["ticker"] == "GOOG") | (df["ticker"] == "MSFT")].reset_index(drop = True)
sub

Unnamed: 0,permno,caldt,ticker,prc,ret,me,analysts,ln_analyst,gr_10,Month
0,10107,2020-01-31,MSFT,170.23,0.079455,1294777.0,34.0,3.555348,True,1
1,10107,2020-02-28,MSFT,162.00999,-0.045292,1232256.0,32.0,3.496508,True,2
2,10107,2020-03-31,MSFT,157.71001,-0.026541,1197019.0,32.0,3.496508,True,3
3,10107,2020-04-30,MSFT,179.21001,0.136326,1359028.0,32.0,3.496508,True,4
4,10107,2020-05-29,MSFT,183.25,0.025389,1389665.0,30.0,3.433987,True,5
5,10107,2020-06-30,MSFT,203.50999,0.110559,1540774.0,33.0,3.526361,True,6
6,10107,2020-07-31,MSFT,205.00999,0.007371,1551444.0,32.0,3.496508,True,7
7,10107,2020-08-31,MSFT,225.53,0.10258,1706733.0,32.0,3.496508,True,8
8,10107,2020-09-30,MSFT,210.33,-0.067397,1590936.0,32.0,3.496508,True,9
9,10107,2020-10-30,MSFT,202.47,-0.03737,1530774.0,32.0,3.496508,True,10


# Another interesting query:
`df.query("ticker in ['GOOG', 'MSFT']").reset_index(drop = True)`

This is more readable and performs much faster