In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
from sklearn import linear_model
import seaborn as sns

## Ground Cricket Chirps

In _The Song of Insects_ (1948) by George W. Pierce, Pierce mechanically measured the frequency (the number of wing vibrations per second) of chirps (or pulses of sound) made by a striped ground cricket, at various ground temperatures.  Since crickets are ectotherms (cold-blooded), the rate of their physiological processes and their overall metabolism are influenced by temperature.  Consequently, there is reason to believe that temperature would have a profound effect on aspects of their behavior, such as chirp frequency.

In general, it was found that crickets did not sing at temperatures colder than 60&deg; F or warmer than 100&deg; F.

In [2]:
ground_cricket_data = {"Chirps/Second": [20.0, 16.0, 19.8, 18.4, 17.1, 15.5, 14.7,
                                         15.7, 15.4, 16.3, 15.0, 17.2, 16.0, 17.0,
                                         14.4],
                       "Ground Temperature": [88.6, 71.6, 93.3, 84.3, 80.6, 75.2, 69.7,
                                              71.6, 69.4, 83.3, 79.6, 82.6, 80.6, 83.5,
                                              76.3]}
df = pd.DataFrame(ground_cricket_data)

### Tasks

1. Find the linear regression equation for this data.
2. Chart the original data and the equation on the chart.
3. Find the equation's $R^2$ score (use the `.score` method) to determine whether the equation is a good fit for this data. (0.8 and greater is considered a strong correlation.)
4. Extrapolate data:  If the ground temperature reached 95&deg; F, then at what approximate rate would you expect the crickets to be chirping?
5. Interpolate data:  With a listening device, you discovered that on a particular morning the crickets were chirping at a rate of 18 chirps per second.  What was the approximate ground temperature that morning?

In [37]:
temp_x = df[['Ground Temperature']]
chirp_y = df[['Chirps/Second']]
lm = linear_model.LinearRegression()
lm.fit(temp_x, chirp_y)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)

In [38]:
print ('The intercept is {}'.format(lm.intercept_))
print ('The correlation coefficient is {}'.format(lm.coef_))
print ('The regression is {}'.format(lm.score(temp_x, chirp_y)))

The intercept is [ 0.45931465]
The correlation coefficient is [[ 0.20299973]]
The regression is 0.6922946529146998


In [42]:
sns.set(style="ticks")
sns.lmplot("Ground Temperature", "Chirps/Second", data=df, fit_reg=True)
plt.title('Correlation between ground temperature and chirps')
plt.show()



### Extrapolate and Interpolate data

In [6]:
lm.predict(95)

array([[ 19.74428913]])

In [7]:
lmchirp = linear_model.LinearRegression()
lmchirp.fit(chirp_y, temp_x)
lmchirp.predict(18)

array([[ 84.2347963]])

If the temperature reached 95&deg; F, the crickets would be chirping at the rate of 19 chirps per second and if the crickets were chipping at the rate of 18 chirps per second the ground temperature would be 84&deg; F

## Brain vs. Body Weight

In the file `brain_body.txt`, the average brain and body weight for a number of mammal species are recorded. Load this data into a Pandas data frame.

### Tasks

1. Find the linear regression equation for this data for brain weight to body weight.
2. Chart the original data and the equation on the chart.
3. Find the equation's $R^2$ score (use the `.score` method) to determine whether the equation is a good fit for this data. (0.8 and greater is considered a strong correlation.)

In [8]:
brain_body = pd.read_fwf("brain_body.txt")

In [9]:
brain_x = brain_body[['Brain']]
body_y = brain_body[['Body']]
lm1 = linear_model.LinearRegression()
lm1.fit(brain_x, body_y)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)

In [10]:
print ("The intercept is {}".format(lm1.intercept_))
print ("The correlation coefficient is {}".format(lm1.coef_))
print ("The regression is {}".format(lm1.score(brain_x, body_y)))

The intercept is [ 91.00439621]
The correlation coefficient is [[ 0.96649637]]
The regression is 0.8726620843043331


In [41]:
sns.set(style="ticks")
sns.lmplot("Brain", "Body", data=brain_body, fit_reg=True)
plt.title('Brain vs. Body Weight')
plt.tight_layout()
plt.show()



## Salary Discrimination

The file `salary.txt` contains data for 52 tenure-track professors at a small Midwestern college. This data was used in legal proceedings in the 1980s about discrimination against women in salary.

The data in the file, by column:

1. Sex. 1 for female, 0 for male.
2. Rank. 1 for assistant professor, 2 for associate professor, 3 for full professor.
3. Year. Number of years in current rank.
4. Degree. Highest degree. 1 for doctorate, 0 for master's.
5. YSdeg. Years since highest degree was earned.
6. Salary. Salary/year in dollars.

### Tasks

2. Find the selection of columns with the best $R^2$ score.
3. Report whether sex is a factor in salary. Support your argument with graph(s) if appropriate.

### Read .txt file

In [12]:
df2 = pd.read_fwf("salary.txt", header=None, 
                 names=["Sex", "Rank", "Year", "Degree", "YSdeg", "Salary"])

In [13]:
feature_cols = ['Sex', 'Rank', 'Year', 'Degree', 'YSdeg']
lm2 = linear_model.LinearRegression()

x = df2[feature_cols]
y = df2['Salary']

lm2.fit(x, y)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)

### Find intercept and correlation coefficient

In [14]:
print ("The intercept is {}".format(lm2.intercept_))
coef_names = list(zip(feature_cols, lm2.coef_))
print ("The correlation coeffcient is {}".format(coef_names))

The intercept is 11410.146547255616
The correlation coeffcient is [('Sex', 1241.7924996014231), ('Rank', 5586.1814495214376), ('Year', 482.85976782882136), ('Degree', -1331.6440634059163), ('YSdeg', -128.79057354486233)]


### Find the linear regression

In [15]:
lm_sex_salary = linear_model.LinearRegression()
x1 = df2[['Sex']]
y1 = df2['Salary']
lm_sex_salary.fit(x1, y1)
print("Sex: ", lm_sex_salary.score(x1, y1))

lm_rank_salary = linear_model.LinearRegression()
x2 = df2[['Rank']]
y2 = df2['Salary']
lm_rank_salary.fit(x2, y2)
print("Rank: ", lm_rank_salary.score(x2, y2))

lm_year_salary = linear_model.LinearRegression()
x3 = df2[['Year']]
y3 = df2['Salary']
lm_year_salary.fit(x3, y3)
print("Year: ", lm_year_salary.score(x3, y3))

lm_degree_salary = linear_model.LinearRegression()
x4 = df2[['Degree']]
y4 = df2['Salary']
lm_degree_salary.fit(x4, y4)
print("Degree: ", lm_degree_salary.score(x4, y4))

lm_ysdeg_salary = linear_model.LinearRegression()
x5 = df2[['YSdeg']]
y5 = df2['Salary']
lm_ysdeg_salary.fit(x5, y5)
print("YSdeg: ", lm_ysdeg_salary.score(x5, y5))

Sex:  0.0638989258329
Rank:  0.752536053927
Year:  0.490937026769
Degree:  0.00486168098475
YSdeg:  0.455428134584


The rank vs salary had the highest r-squared which suggests they have a fairly high correlation between the variable, while having a degree does not have any correlation with salary.

In [43]:
sns.pairplot(df2, x_vars=["Sex", "Rank", "Year", "Degree", "YSdeg"], y_vars=["Salary"],
             size=5, aspect=.8, kind="reg")
plt.tight_layout()
plt.show()



In [24]:
sns.lmplot(x="Rank", y="Salary", hue="Sex", data=df2, palette="Set1")
plt.tight_layout()
plt.show()



In [18]:
# import numpy as np
# values = np.zeros(260)
# values.shape = (52, 5)
# fake_data = pd.DataFrame(values, columns=['Sex', 'Rank', 'Year', 'Degree', 'YSdeg'])
# fake_data['Rank'] = df2["Rank"]
# lm2.score(fake_data, y)

In [19]:
# plt.scatter(x1, y1)
# plt.plot(x1, lm_sex_salary.predict(x1), color='blue')
# plt.show()

In [20]:
# plt.scatter(x2, y2)
# plt.plot(x2, lm_rank_salary.predict(x2), color='blue')
# plt.show()

In [21]:
# plt.scatter(x3, y3)
# plt.plot(x3, lm_year_salary.predict(x3), color='blue')
# plt.show()

In [22]:
# plt.scatter(x4, y4)
# plt.plot(x4, lm_degree_salary.predict(y4), color='blue')
# plt.show()

In [23]:
# plt.scatter(x5, y5)
# plt.plot(x5, lm_ysdeg_salary.predict(x5), color='blue')
# plt.show()