# Business Classification



Sample code for Testing the **Logistic Regression** function from the *Sklearn* library.

Import all the data from *business.json* file and get: 

- for Feture matrix:  "stars" and "review_counts" 

- for label column (*target*): "is_open". 

Function to use: 

```python
    model = linear_model.LogisticRegression()
    model.fit(X,y)
```

In [1]:
import urllib.request # read and open URLs
import urllib.parse # read and open NODE files
import json # Reading Jason File
from sklearn import linear_model

In [2]:
# Reading the Jason File
# encoding is important: standard UTF8
data = [json.loads(line) for line in open('business.json', 'r',encoding='utf8')]

In [3]:
# Creating the DATA SET 
import pandas as pd

df = pd.DataFrame(data, columns=["business_id", 
                            "name",
                            "address",
                            "city",
                            "state",
                            "postal_code",
                            "latitude",
                            "longitude",
                            "stars",
                            "review_count",
                            "is_open",
                             "attributes"])

In [4]:
# Attributes containg more Features that won't be used
df.head(5)

Unnamed: 0,business_id,name,address,city,state,postal_code,latitude,longitude,stars,review_count,is_open,attributes
0,1SWheh84yJXfytovILXOAQ,Arizona Biltmore Golf Club,2818 E Camino Acequia Drive,Phoenix,AZ,85016,33.522143,-112.018481,3.0,5,0,{'GoodForKids': 'False'}
1,QXAEGFB4oINsVuTFxEYKFQ,Emerald Chinese Restaurant,30 Eglinton Avenue W,Mississauga,ON,L5R 3E7,43.605499,-79.652289,2.5,128,1,"{'RestaurantsReservations': 'True', 'GoodForMe..."
2,gnKjwL_1w79qoiV3IC_xQQ,Musashi Japanese Restaurant,"10110 Johnston Rd, Ste 15",Charlotte,NC,28210,35.092564,-80.859132,4.0,170,1,"{'GoodForKids': 'True', 'NoiseLevel': 'u'avera..."
3,xvX2CttrVhyG2z1dFg_0xw,Farmers Insurance - Paul Lorenz,"15655 W Roosevelt St, Ste 237",Goodyear,AZ,85338,33.455613,-112.395596,5.0,3,1,
4,HhyxOkGAM07SRYtlQ4wMFQ,Queen City Plumbing,"4209 Stuart Andrew Blvd, Ste F",Charlotte,NC,28217,35.190012,-80.887223,4.0,4,1,"{'BusinessAcceptsBitcoin': 'False', 'ByAppoint..."


In [19]:
# Get review_Counts and Stars

X_df=df[['review_count', 'stars']]

y_df = df[['is_open']]

In [20]:
X_df.head(5)

Unnamed: 0,review_count,stars
0,5,3.0
1,128,2.5
2,170,4.0
3,3,5.0
4,4,4.0


In [21]:
y_df.head()

Unnamed: 0,is_open
0,0
1,1
2,1
3,1
4,1


**Adding the Offset value**

In [11]:
X_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 2 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   review_count  10000 non-null  int64  
 1   stars         10000 non-null  float64
dtypes: float64(1), int64(1)
memory usage: 156.4 KB


In [14]:
ones = [1]*10000

In [22]:
X_df.insert(0,'ones',ones,True)
X_df.head()

Unnamed: 0,ones,review_count,stars
0,1,5,3.0
1,1,128,2.5
2,1,170,4.0
3,1,3,5.0
4,1,4,4.0


In [24]:
# Take a Look
# Look at first 10 rows of X and y
print("Label: ", y_df.loc[:10,:], "\nFeatures:", X_df.loc[:10,:])

Label:      is_open
0         0
1         1
2         1
3         1
4         1
5         1
6         1
7         1
8         0
9         1
10        1 
Features:     ones  review_count  stars
0      1             5    3.0
1      1           128    2.5
2      1           170    4.0
3      1             3    5.0
4      1             4    4.0
5      1             3    2.5
6      1             7    3.5
7      1             3    3.5
8      1             8    5.0
9      1             8    4.5
10     1             5    2.0


### Training Data

In [25]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X_df, y_df, test_size=0.33, random_state=23)

### Model

In [27]:
y_train = y_train.values.ravel() # To reshape
y_train

array([1, 1, 1, ..., 1, 1, 1], dtype=int64)

In [28]:
model = linear_model.LogisticRegression()
model.fit(X_train, y_train)

LogisticRegression()

### Predictions

In [31]:
predictions = model.predict(X_test)

In [32]:
predictions

array([1, 1, 1, ..., 1, 1, 1], dtype=int64)

In [33]:
correctPredictions = (predictions == y_test.values.ravel())
correctPredictions

array([False,  True, False, ...,  True,  True,  True])

### Performance

Metric: 

$$
    \frac{\sum_{i=1}^N \text{correct predictions}}{\text{Total Predictions}}
$$

In [34]:
# True values are 1's, and False values are 0's

sum(correctPredictions) / len(correctPredictions)

0.8248484848484848

In [35]:
# Sklear Score
model.score(X_test,y_test)

0.8248484848484848

In [36]:
# metrics.accuracy_score

from sklearn.metrics import accuracy_score

accuracy_score(y_test, predictions)

0.8248484848484848