# Linear Regression

In this homework we are going to apply linear regression to the problem of predicting developer satisfaction based upon information about their carrers, from a StackOverflow survey.  The data from this question is based on the [2019 StackOverflow Survey](https://insights.stackoverflow.com/survey/2019); accordingly, the subset bundled with this assignment is also released under the Open Database License (ODbL) v1.0.  For this problem, you should not use Scikit-Learn, but instead implement all the least squares solutions manually.


In [5]:
import csv
import gzip
import math
import hashlib
import numpy as np
import pandas as pd
import re

### Q2 Splitting Data

Now we prepare the converted data for regression. In this step, we:

 1. Extract the data as a numpy array
 2. Split the data into train and validation sets.  You can use the first 20% of the data (rounded down) as the validation set and keep the remaining as the training set. (Note that it is common practice to randomize the dataset; this has already been done. Don't shuffle the dataset for this assignment.)
 3. Split each set into the predicted column (the first column in the data frame) and the feature columns (the remaining columns), plus a final column corresponding to a constant 1.0 value.  Not that you should keep the ordering of the feature columns the same as they appear in the data.


In [6]:
df = pd.read_csv('all.csv', dtype=str, keep_default_na=False)

In [7]:
df["Price"] = pd.to_numeric(df["Price"])
df["PropertyType"] = pd.to_numeric(df["PropertyType"])
df["Beds"] = pd.to_numeric(df["Beds"])
df["Baths"] = pd.to_numeric(df["Baths"])
df["Sqft"] = pd.to_numeric(df["Sqft"])
df["YearBuilt"] = pd.to_numeric(df["YearBuilt"], downcast='integer')
df["WalkScore"] = pd.to_numeric(df["WalkScore"])
df["TransitScore"] = pd.to_numeric(df["TransitScore"])
df["ParkingPrice"] = pd.to_numeric(df["ParkingPrice"])
df["ParkingType"] = pd.to_numeric(df["ParkingType"])
df["Cooling"] = pd.to_numeric(df["Cooling"])
df["Laundry"] = pd.to_numeric(df["Laundry"])

#df= df.drop(['Address', 'Description'], axis=1)

In [31]:
def clean_description(description):
    description = description.lower()
    return ' '.join(re.sub('(@[A-Za-z0-9]+)|([^0-9A-Za-z \t])|(\w+:\/\/\S+)|([-,\"@\'?\.$%_\d\+\:])', ' ', description).split())

df["Clean Description"] = df['Description'].apply(lambda x: clean_description(x))

stopwords = {'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', 'your', 'yours',
                       'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', 'her', 'hers', 'herself',
                       'it', 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which',
                       'who', 'whom', 'this', 'that', 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be',
                       'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'an',
                       'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by',
                       'for', 'with', 'about', 'between', 'into', 'through', 'during', 'before',
                       'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over',
                       'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why',
                       'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such',
                       'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', 'too', 'very', 'can',
                       'just', 'should', 'now', '', 'a', 's'}



In [32]:
df["Clean Description"]

0       great house in the slopes quiet dead end great...
1       from virtual to reality call us for an in pers...
2       from virtual to reality call us for an in pers...
3       from virtual to reality call us for an in pers...
4       from virtual to reality call us for an in pers...
                              ...                        
2475    two bedroom apartment available at electric pl...
2476    two bedroom apartment available at electric pl...
2477    studio layout open room plan on second floor r...
2478    this square foot single family home has bedroo...
2479    now available welcome home to this beautiful t...
Name: Clean Description, Length: 2480, dtype: object

In [33]:
import nltk

In [42]:
def getAllWords(df):
    allWords = {}
    for index, row in df.iterrows():
        tmpDict = row['Clean Description'].split()
        for word in tmpDict:
            if word not in stopwords:
                if word not in allWords:
                    allWords[word] = 1
                else:
                    allWords[word] += 1
    allWords = pd.DataFrame(list(allWords.items()),columns = ['word','count']) 
    return allWords

wordcounts = getAllWords(df).sort_values(by=['count'], ascending=False)
vocab = wordcounts[1:25]
print(wordcounts[1:25])

              word  count
51          access   5602
237           site   5585
324       features   5542
528         floors   4293
58          center   4293
250     appliances   4222
57         fitness   4187
465          rooms   4071
1970         fifth   3985
2827         grand   3967
3383      kaufmann   3948
38      pittsburgh   3651
135           room   3151
379           unit   2961
256       hardwood   2947
125         washer   2939
126          dryer   2930
298            air   2886
375           walk   2853
297     dishwasher   2805
257           high   2790
299   conditioning   2786
267      stainless   2782
268          steel   2765


In [46]:
for word in vocab["word"]: 
    #print(word)
    df[word] = ""
print(df)

       Price                                Address  PropertyType  Beds  \
0     1185.0     1 Magdalene St Pittsburgh PA 15203           1.0   1.0   
1     1575.0    10 Allegheny Ct Pittsburgh PA 15212           0.0   2.0   
2      800.0    10 Allegheny Ct Pittsburgh PA 15212           0.0   0.0   
3     1575.0    10 Allegheny Ct Pittsburgh PA 15212           0.0   2.0   
4     1395.0    10 Allegheny Ct Pittsburgh PA 15212           0.0   1.0   
...      ...                                    ...           ...   ...   
2475   780.0  957 Bockstoce Ave Pittsburgh PA 15234           0.0   1.0   
2476   950.0  957 Bockstoce Ave Pittsburgh PA 15234           0.0   2.0   
2477   645.0         97 23rd St Pittsburgh PA 15203           1.0  -1.0   
2478   995.0         97 27th St Pittsburgh PA 15203           1.0   1.0   
2479   950.0   978 Garfield Ave Pittsburgh PA 15221           0.0   1.0   

      Baths    Sqft  YearBuilt  WalkScore  TransitScore  ParkingPrice  ...  \
0       1.5  1250.0  

In [57]:
bow = []
for index, row in df.iterrows():
    tmpDict = row['Clean Description']
    for word in vocab["word"]:
        df.loc[index, word] = tmpDict.count(word)
        
            
df 

Unnamed: 0,Price,Address,PropertyType,Beds,Baths,Sqft,YearBuilt,WalkScore,TransitScore,ParkingPrice,...,hardwood,washer,dryer,air,walk,dishwasher,high,conditioning,stainless,steel
0,1185.0,1 Magdalene St Pittsburgh PA 15203,1.0,1.0,1.5,1250.0,1900,83.0,51.0,-1.0,...,0,0,0,0,0,0,0,0,0,0
1,1575.0,10 Allegheny Ct Pittsburgh PA 15212,0.0,2.0,1.0,1020.0,-1,-1.0,-1.0,-1.0,...,0,0,0,0,0,0,0,0,0,0
2,800.0,10 Allegheny Ct Pittsburgh PA 15212,0.0,0.0,1.0,480.0,-1,-1.0,-1.0,-1.0,...,0,0,0,0,0,0,0,0,0,0
3,1575.0,10 Allegheny Ct Pittsburgh PA 15212,0.0,2.0,1.0,1020.0,-1,-1.0,-1.0,-1.0,...,0,0,0,0,0,0,0,0,0,0
4,1395.0,10 Allegheny Ct Pittsburgh PA 15212,0.0,1.0,1.0,680.0,-1,-1.0,-1.0,-1.0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2475,780.0,957 Bockstoce Ave Pittsburgh PA 15234,0.0,1.0,1.0,-1.0,-1,59.0,54.0,-1.0,...,0,1,0,1,0,1,0,1,0,0
2476,950.0,957 Bockstoce Ave Pittsburgh PA 15234,0.0,2.0,1.0,-1.0,-1,59.0,54.0,-1.0,...,0,1,0,1,0,1,0,1,0,0
2477,645.0,97 23rd St Pittsburgh PA 15203,1.0,-1.0,-1.0,-1.0,-1,96.0,64.0,-1.0,...,1,0,0,0,2,0,0,0,0,0
2478,995.0,97 27th St Pittsburgh PA 15203,1.0,1.0,1.0,1000.0,-1,84.0,49.0,0.0,...,0,0,0,0,0,0,0,0,0,0


In [73]:
def split_data(df):
    """split the data into training and validation sets, and convert them to np.ndarray. (Step 1 and 2 above.)

    args:
        df : pandas.DataFrame -- the parsed data, as returned by parse_stackoverflow_data()

    returns: X_train, y_train, X_val, y_val
      X_train  : np.ndarray -- the second 80% of the data features
      y_train : np.ndarray -- the second 80% of the target values
      X_val : np.ndarray -- the first 20% (rounded down) of the data features
      y_val : np.ndarray -- the first 20% of the target valuesn
    """
    n = len(df)
    df['final'] = 1.0
    df = df.to_numpy()
    i = int(np.floor(n*0.2))
    val = df[0:i,:]
    train = df[i:,:]
    Y_train = train[:, 0]
    X_train = train[:, 1:]
    #print(Y_train)
    Y_val = val[:, 0]
    X_val = val[:, 1:]
    #print(X_train[:,17])
    #print(Y_train)
    return(X_train, Y_train, X_val, Y_val)
X_train, Y_train, X_val, Y_val = split_data(df)

[[1065.0 '305 Fairmount St Pittsburgh PA 15232' 1.0 ...
  'Great Deal on a Shadyside Two Bedroom! Lots of Space! Call Today to Schedule an Appointment! - 2 Bedrooms / 1 Bathroom\n\nMonthly Rent: $1,065.00 + Gas & Electric\n*Water, Sewage & Trash Removal Included\n\nFeatures: \n- Kitchen equipped with Stove & Fridge\n- Two Equal Sized Bedrooms\n- Wall to Wall Carpet in Bedrooms & Living Room\n- Central Heat & Air Conditioning \n- Double Pane Windows\n- Private Entrance\n- Coin-Operated Laundry in Building\n\nPet Policy: Cats Allowed with Additional Refundable Deposit + $20.00 Pet Rent\n\nLocation:\n- Perfect Friendship/Shadyside Location - Close to Everything!\n- Minutes from West Penn & Shadyside Hospitals\n- Public Transportation Nearby\n\nCall Today to Schedule an Appointment!\n\nContact: John C.R. Kelly Realty, Inc.\n\nNo Dogs Allowed\n\n(RLNE5736358)Read less'
  'great deal on a shadyside two bedroom lots of space call today to schedule an appointment bedrooms bathroom monthly rent

array([[ 1. ,  2. ,  1. , ...,  1. ,  0. ,  1. ],
       [ 1. ,  3. ,  1.5, ...,  0. ,  0. ,  1. ],
       [ 0. ,  1. ,  1. , ...,  0. ,  0. ,  1. ],
       ...,
       [ 1. , -1. , -1. , ...,  0. ,  0. ,  1. ],
       [ 1. ,  1. ,  1. , ...,  0. ,  0. ,  1. ],
       [ 0. ,  1. ,  1. , ...,  0. ,  0. ,  1. ]])

### Q3 Linear Regression

Now we are going to build a simple scikit-learn-like class for least squares linear regression.  Recall from lecture that the linear regression approach models the data as
$$ y^{(i)} \approx \theta^T x^{(i)} $$
and the optimal $\theta$ is given by
$$ \theta^\star = (X^T X)^{-1}X^T y $$
using the notation described in the slides and course notes.  Recall, as mentioned in class, that you should use the `np.linalg.solve()` function rather than the `np.linalg.inv()` function to compute this solution.

Implement the class below, plus the `squared_error` function.

In [22]:
def squared_error(y_pred, y):
    """ Utility function to compute squared error
    args:
        y_pred : np.ndarray[num_examples] -- the predictions
        y : np.ndarray[num_examples] -- the ground truth values

    returns:
        float : _average_ squared error between y_pred and y
    """
    return(np.mean(np.square(np.subtract(y_pred,y))))

class LinearRegression():
    """ Perform linear regression and predict the output on unseen examples. 
    
    attributes: 
        theta (np.ndarray) : vector containing parameters for the features
    """

    def __init__(self, X, y):
        """ Train the linear regression model by computing the estimate of the parameters
        You should store the model parameters in self.theta

        args: 
            X (np.ndarray[num_examples, num_columns]) : matrix of training data
            y (np.ndarray[num_examples]) : vector of output variables

        return: LinearRegression -- returns itself (for convenience)
        """
  
        self.theta = np.linalg.solve(X.T @ X, X.T @ y)


    def predict(self, X): 
        """ Use the learned model to predict the output of X_p

        args: 
            X : np.ndarray[num_examples, num_columns] -- matrix of features for which we form a prediction

        return: 
            np.ndarray[num_examples], vector of predicted outputs
        """
        #print(X)
        #print(self.theta)
        return(X @ self.theta)
    
    
    
    

In [23]:
def evaluate_linear_regression(X_train, y_train, X_val, y_val):
    """ Evaluate the squared error of linear regression versus the simple mean-prediciton baseline.
    
    Args: X_train, y_train, X_val, y_val -- output of split_data() function
    
    Return: Tuple[validation_mse, baseline_mse]:
        validation_mse: float -- squared error of predictions on validation set, when training on training set
        baseline_mse: float -- squared error of predicting the mean on the training set
    """
    lm = LinearRegression(X_train, y_train)
    
    
    baseline_mse = squared_error(np.mean(y_train), y_val)
    
    validation_mse = squared_error(lm.predict(X_val), y_val)
    return((validation_mse, baseline_mse))

evaluate_linear_regression(X_train, Y_train, X_val, Y_val)   

(143821.39515347502, 292466.19087478047)

In [35]:
#https://stattrek.com/regression/slope-confidence-interval.aspx?Tutorial=AP
def standard_error(X_train, y_train):
    lm = LinearRegression(X_train, y_train)
    y = y_train
    y_hat = lm.predict(X_train)
    x = X_train
    x_bar = np.mean(X_train)
    n = len(y_train)
    return np.sqrt(sum((y - y_hat)**2)/(n - 2))/np.sqrt(sum((x - x_bar)**2)), lm.theta

array([0.02198998, 0.02209206, 0.02209672, 0.0078201 , 0.00297413,
       0.03778921, 0.03617847, 0.03849233, 0.02206963, 0.02207361,
       0.02208288, 0.02208725])

In [37]:
std_err, coefs = standard_error(X_train, Y_train)

In [43]:
cols = list(df.drop(['Price', 'final'], axis=1).columns)
cols = cols.append()

['PropertyType',
 'Beds',
 'Baths',
 'Sqft',
 'YearBuilt',
 'WalkScore',
 'TransitScore',
 'ParkingPrice',
 'ParkingType',
 'Cooling',
 'Laundry',
 'final']

In [102]:
#https://online.stat.psu.edu/stat501/lesson/2/2.12
#https://stats.stackexchange.com/questions/324260/manually-calculate-the-parameters-std-error-of-lm-output-in-r
from scipy import stats
t_value = coefs/std_err
p_value = (1-stats.t.cdf(abs(t_value),df=len(Y_train)-2))*2

In [103]:
summary = pd.DataFrame()
summary['variables'] = cols
summary['coefficients'] = coefs
summary['std error'] = std_err
summary['t_value'] = t_value
summary['p_value'] = p_value
print(summary)

       variables  coefficients  std error      t_value   p_value
0   PropertyType     59.885976   0.021990  2723.330558  0.000000
1           Beds    199.515307   0.022092  9031.088123  0.000000
2          Baths    136.130106   0.022097  6160.647255  0.000000
3           Sqft      0.157870   0.007820    20.187685  0.000000
4      YearBuilt      0.010993   0.002974     3.696187  0.000225
5      WalkScore      2.379358   0.037789    62.963957  0.000000
6   TransitScore      6.384853   0.036178   176.482136  0.000000
7   ParkingPrice      0.080419   0.038492     2.089211  0.036816
8    ParkingType    -14.604286   0.022070  -661.736832  0.000000
9        Cooling    153.803979   0.022074  6967.776690  0.000000
10       Laundry     38.744494   0.022083  1754.503895  0.000000
11         final     56.575380   0.022087  2561.449559  0.000000


In [74]:
import matplotlib.pyplot as plt
#plt.plot(X_train, Y_train)

In [None]:
#https://www.statisticshowto.com/probability-and-statistics/coefficient-of-determination-r-squared/
#https://en.wikipedia.org/wiki/Simple_linear_regression