Here's a basic implementation of linear regression in Python using the 
scikit-learn library. 

This code is implemented on a CSV file - WineQT.csv. 

The code loads the data into NumPy arrays, splits it into training and testing sets, trains a linear regression model, evaluates the model on the test data, and makes a prediction for a new sample. The values and the prediction are printed to the console.

In [1]:
import pandas as pd


In [2]:
# Load the dataset
df = pd.read_csv ('WineQT.csv')

In [3]:
##Performing simple exploratory data analysis
df

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality,Id
0,7.4,0.700,0.00,1.9,0.076,11.0,34.0,0.99780,3.51,0.56,9.4,5,0
1,7.8,0.880,0.00,2.6,0.098,25.0,67.0,0.99680,3.20,0.68,9.8,5,1
2,7.8,0.760,0.04,2.3,0.092,15.0,54.0,0.99700,3.26,0.65,9.8,5,2
3,11.2,0.280,0.56,1.9,0.075,17.0,60.0,0.99800,3.16,0.58,9.8,6,3
4,7.4,0.700,0.00,1.9,0.076,11.0,34.0,0.99780,3.51,0.56,9.4,5,4
...,...,...,...,...,...,...,...,...,...,...,...,...,...
1138,6.3,0.510,0.13,2.3,0.076,29.0,40.0,0.99574,3.42,0.75,11.0,6,1592
1139,6.8,0.620,0.08,1.9,0.068,28.0,38.0,0.99651,3.42,0.82,9.5,6,1593
1140,6.2,0.600,0.08,2.0,0.090,32.0,44.0,0.99490,3.45,0.58,10.5,5,1594
1141,5.9,0.550,0.10,2.2,0.062,39.0,51.0,0.99512,3.52,0.76,11.2,6,1595


In [4]:
df.columns

Index(['fixed acidity', 'volatile acidity', 'citric acid', 'residual sugar',
       'chlorides', 'free sulfur dioxide', 'total sulfur dioxide', 'density',
       'pH', 'sulphates', 'alcohol', 'quality', 'Id'],
      dtype='object')

In [5]:
df.isnull().sum()

fixed acidity           0
volatile acidity        0
citric acid             0
residual sugar          0
chlorides               0
free sulfur dioxide     0
total sulfur dioxide    0
density                 0
pH                      0
sulphates               0
alcohol                 0
quality                 0
Id                      0
dtype: int64

In [6]:
df.nunique()

fixed acidity             91
volatile acidity         135
citric acid               77
residual sugar            80
chlorides                131
free sulfur dioxide       53
total sulfur dioxide     138
density                  388
pH                        87
sulphates                 89
alcohol                   61
quality                    6
Id                      1143
dtype: int64

In [7]:
from sklearn.linear_model import LinearRegression, Ridge
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.metrics  import accuracy_score


In [8]:
X = df.drop(columns=['alcohol'])
Y = df['alcohol']

In [10]:
 #splitting the data into train and test
 X_train,X_test,y_train,y_test = train_test_split(X, Y, test_size=0.25)


In [25]:
# Predict the response for data

model = LinearRegression()
model = model.fit(X_train,y_train)

y_pred = model.predict(X_test)

a = model.score(X_test,y_test)*100
print("Score:", round(a,2) ,'%')


Score: 67.92 %




This is a basic implementation and we need to modify the code to suit the requirements of the project, such as using a different type of regression, handling missing data, or regularizing the model to prevent overfitting.