## Your turn!
___

Now it's your turn to come up with a model and interpret it!

1. Pick a question to answer to using the Cameras dataset. Pick a variable to predict and one variable to use to predict it.
2. Fit a GLM model of the appropriate family. (Check out [Monday's challenge](https://www.kaggle.com/rtatman/regression-challenge-day-1) if you need a refresher).
3. *Optional but recommended:* Plot diagnostic plots for your model. Does it seem like your model is a good fit for your data? If you're fitting a linear or Poisson model, are the residuals normally distributed (no patterns in the first plot and the points in the second plot are all in a line)? Are there any influential outliers?
4. Check out your model using the summary() function. Does your input variable have a strong relationship to the output variable you're predicting?
5. Write a couple sentences describing what you've learned from your model. (It could just be that it's not a very good model!)
5. Plot your two variables & use "geom_smooth" and the appropriate family to fit and plot a model. Does this confirm what you learned from examining your model?
6. *Optional:* If you want to share your analysis with friends or to ask for help, you’ll need to make it public so that other people can see it.
    * Publish your kernel by hitting the big blue “publish” button. (This may take a second.)
    * Change the visibility to “public” by clicking on the blue “Make Public” text (right above the “Fork Notebook” button).
    * Tag your notebook with 5daychallenge

In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
# from sklearn import linear_model
# from sklearn.metrics import mean_squared_error, r2_score
# GFX
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

from subprocess import check_output
print(check_output(["ls", "../input/1000-cameras-dataset"]).decode("utf8"))

In [None]:
cameras = pd.read_csv("../input/1000-cameras-dataset/camera_dataset.csv")
cameras.head(5).transpose()
print("Shape: %s rows, %s columns" % (cameras.shape[0], cameras.shape[1]))

# nas = cameras.isnull().sum(axis=0)

df = pd.concat([cameras.isnull().sum(axis=0), \
                cameras.applymap(lambda x: True if x==0 else False).sum(axis=0), \
                cameras.dtypes], axis = 1)
df.columns = ["#NA", "#0s", "dtypes"]
print(df)

Not that many NAs but a suspicious # of 0s... Let's take a closer look at the NAs.

In [None]:
na_idxs = cameras["Macro focus range"].loc[cameras["Macro focus range"].isnull()].index.values[0]
print(cameras[(na_idxs-2):(na_idxs+2)])
cameras_orig = cameras
cameras.dropna(axis=0, how="any",inplace=True)

It appears that only 2 of the cameras (of 1038) have NA fields, let's drop them for now.

Now we'll plot some histograms to get a feel for the variable distributions:

In [None]:
num_features = cameras.columns[1:]
fig, axs = plt.subplots(4,3,figsize=(15,13))
axs = axs.flatten()

i = 0
for feature in num_features:
    sns.distplot(cameras[feature], ax = axs[i], \
                 color=(sns.color_palette()[i % len(sns.color_palette())]));
    i += 1

Some observations:
* Most of the cameras are pretty new in the dataset (see Release date)
* Features with a lot of 0s: Zooms (W) and (T), Dimensions
* Features with positive outliers: Zoom tele (T), Macro range, Storage, Weight, Price

Let's now check out the correlation between variables so we can pick out a dependent variable for our regression. Starting with a Pearson (clustered) correlation matrix:

In [None]:
corr = cameras[num_features].corr()
cg = sns.clustermap(corr, cmap="YlGnBu");
plt.show();

Interestingly, most of the technical parameters (e.g. resolution) don't seem very correlated with price, the strongest positive correlations are Weight and Dimensions.<br>
In fact, Zooms are negatively correlated.<br>
For completeness, let's examine the Spearman correlation matrix:

In [None]:
corr = cameras[num_features].corr("spearman")
cg = sns.clustermap(corr, cmap="RdYlGn");
plt.show();
corr_feats = ['Price', 'Weight (inc. batteries)', 'Dimensions', 'Zoom tele (T)', 'Zoom wide (W)', 'Macro focus range']

We can examine some pairplots to get a better feel for the interplay between variables:

In [None]:
ax = sns.pairplot(cameras[corr_feats], palette = "Spectral", hue="Price");
ax._legend.remove()
plt.show();

It's hard to see a clear relationship with price, but there also appear to be many outliers. Let's see how the relationships for a few of the features would look if we got rid of price outliers.

In [None]:
def nol(df, feature, m=2):
    if m==0:
        return df
    x = df[feature]
    mask = abs(x - np.mean(x)) < m*np.std(x)
    return df.loc[mask]

reg_feats = list(corr_feats[i] for i in [1,2,4,5])
m_=1.5 # No. of std deviations - cutoff
cameras_nol = nol(cameras, 'Price', m=m_)

fig, axs = plt.subplots(2,2,figsize=(15,10))
axs = axs.flatten()
i=0
for feature in reg_feats:
    df = cameras_nol
    ax = sns.regplot(x=feature, y='Price', data=df, ax=axs[i])
    left, right = min(df[feature]) - 0.3*df[feature].std(),\
        max(df[feature] + 0.3*df[feature].std())
    ax.set_xlim(left, right)
    i+=1

There are a lot of 0s for the dependent variable which I suspect are NAs (barring the existence weightless cameras). Let's see how the regression plots would look like with 0s removed for dependent variables:

In [None]:
def nozero(df, feature):
    x = df[feature]
    mask = x != 0
    return df.loc[mask]

reg_feats = list(corr_feats[i] for i in [1,2,4,5])
m_=1.5 # No. of std deviations - cutoff

fig, axs = plt.subplots(2,2,figsize=(15,10))
axs = axs.flatten()
i=0
cameras_nol = nol(cameras, 'Price', m=m_)
for feature in reg_feats:
    df = nozero(cameras_nol, feature)
    ax = sns.regplot(x=feature, y='Price', data=df, ax=axs[i], marker='.')
    left, right = min(df[feature]) - 0.3*df[feature].std(),\
        max(df[feature] + 0.3*df[feature].std())
    ax.set_xlim(left, right)
    i+=1

It's starting to look a little better. The prices are clearly clustered. We can try to separate them into a few groups with KMeans:

In [None]:
from sklearn.cluster import KMeans
prices = np.array(cameras['Price'])
colors = ["blue", "green", "yellow", "orange"]
fig, ax = plt.subplots(figsize=(15,6))
# optional - logscale (really saving for later projs)
# ax = plt.subplot(111)
# ax.set_yscale("log")
for k in range(2,5):
    km = KMeans(n_clusters=k).fit(prices.reshape(-1,1))
    cluster = km.labels_
    for i in range(0,k):
        y=prices[cluster==i]
        plt.scatter(x=[k]*len(y)+np.random.normal(0,0.01,(len(y))), \
                    y=y, c=sns.color_palette()[i], marker='.', alpha=0.8);
ax.set(xlabel="# clusters", xticks=[2,3,4], xticklabels=[2,3,4],\
       ylabel="Price", title="KMeans price clusters")
plt.show()
print("Look at k=4 for filtering:")
km = KMeans(n_clusters=4).fit(prices.reshape(-1,1))
cameras['Cluster'] = km.labels_
print("Group %s: %s\n"*4 % sum(tuple((i+1, (km.labels_ == i).sum()) for i in range(4)),()))

Let's see our regression with only the cheapest cluster, the most populated one, considered:

In [None]:
fig, axs = plt.subplots(2,2,figsize=(15,10))
axs = axs.flatten()
i=0
k=0
for feature in reg_feats:
    df = nozero(cameras, feature)
    df = df.loc[df['Cluster']==k]
    ax = sns.regplot(x=feature, y='Price', data=df, ax=axs[i], marker='.')
    left, right = min(df[feature]) - 0.3*df[feature].std(),\
        max(df[feature] + 0.3*df[feature].std())
    ax.set_xlim(left, right)
    i+=1
print("cluster filtering looks good, but at this zoom the linear relationships look fairly flat...")

We've successfully massaged the data so the prices appear to be in one cohort, but the linear relationships with the variables still appear flat. <br>
Let's see how regressions with some of the other variables (that were prevoiusly shown to have low correlations with price) would look in this cluster.

In [None]:
other_reg_feats = ['Max resolution', 'Effective pixels', 'Storage included', 'Normal focus range']
fig, axs = plt.subplots(2,2,figsize=(15,10))
axs = axs.flatten()
i=0
k=0
for feature in other_reg_feats:
    df = nozero(cameras, feature)
    df = df.loc[df['Cluster']==k]
    ax = sns.regplot(x=feature, y='Price', data=df, ax=axs[i], marker='.')
    left, right = min(df[feature]) - 0.3*df[feature].std(),\
        max(df[feature] + 0.3*df[feature].std())
    ax.set_xlim(left, right)
    i+=1

Superficially, Max resolution seems to have a (slightly) stronger effect on price than the others.<br>
For the sake of completing the exercise, let's choose Max res as the explanatory variable.<br>
Let's check out first if Max res also fared well in the other relatively populous cluster:

In [None]:
feature = 'Max resolution'
fig, axs = plt.subplots(1,2,figsize=(15,5))
axs = axs.flatten()
i = 0
for k in [0,1]:
    df = nozero(cameras, feature)
    df = df.loc[df['Cluster']==k]
    ax = sns.regplot(x=feature, y='Price', data=df, ax=axs[i], marker='.')
    left, right = min(df[feature]) - 0.3*df[feature].std(),\
        max(df[feature] + 0.3*df[feature].std())
    ax.set_xlim(left, right)
    i += 1

If we have any luck here with a univariate linear model, it will be limited to the cheapest price cluster.<br>
Let's run a regression:

In [None]:
import statsmodels.api as sm
import statsmodels.formula.api as smf

df = nozero(cameras, feature)
df = df.loc[df['Cluster'] == 0]
X = df[feature]
X = sm.tools.add_constant(X, prepend=True)
Y = df['Price']
model = sm.OLS(Y,X)
results = model.fit()
print(results.summary())

Observations:
* Very high intercept estimate, in value and deviation.
* Rather low slope coefficient for our independent variable, especially because it's not of a completely different order than Price.
* Jarque-Bera test for normality of residuals also doesn't look great, but lets plot the residuals to confirm.

In [None]:
n = len(results.resid)
ax = plt.figure(figsize = (12,6))
plt.scatter(x=np.linspace(0,n,n), y=results.resid, marker='.');
l = plt.axhline(0, color="green", ls="--");

In [None]:
from scipy.stats import probplot
print("QQ plot:")
ax = plt.figure(figsize = (12,6))
probplot(results.resid, plot=plt)
plt.show()

TBC... (?)