<a href="https://colab.research.google.com/github/sanaaayyy/sanaaayyy/blob/main/Project_Exhibition_2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Prediction Model** of exit polls and opinion polls in india using Linear Regression Model.

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
#importing libraries
import numpy as np
import pandas as pd
import re # re module provides regular expression matching operations
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import MinMaxScaler # MinMaxScaler scales the input data by subtracting the minimum value and dividing the range (i.e., the difference between the minimum and maximum values) of each feature.
from sklearn.preprocessing import LabelEncoder, OneHotEncoder #imports the LabelEncoder and OneHotEncoder classes from the sklearn.preprocessing module.
#The LabelEncoder is used to convert categorical labels to numerical values and the OneHotEncoder is used to convert categorical integer features to a one-hot encoded representation. These classes can be used as preprocessing steps in a machine learning pipeline to convert categorical features of the input data into a numerical representation that can be used by a model.

The code imports several libraries that are commonly used for machine learning tasks, such as:

*   numpy for numerical computations
*   pandas for data manipulation and analysis
*   TfidfVectorizer from sklearn.feature_extraction.text for converting text data into numerical data in the form of Tf-Idf features.
*   train_test_split from sklearn.model_selection for splitting the data into training and testing sets.
*   accuracy_score from sklearn.metrics for evaluating the performance of the model by computing the accuracy score.














In [None]:
#importing dataset
df = pd.read_csv('/content/drive/MyDrive/Project-Exhibition-2/Dataset/Loksabha_1962-2019 .csv' , delimiter=',') 
# delimiter separates the values in each row of the file into separate columns.

In [None]:
#Load DataFrame
df.head()

Unnamed: 0,Pc_name,no,type,state,candidate_name,party,electors,votes,Turnout,margin,margin%,year
0,Adilabad,36,GEN,Andhra Pradesh,G. Narayan Reddy,Indian National Congress,404283,220383,54.5 %,89085,40.40%,1962.0
1,Adoni,27,GEN,Andhra Pradesh,Pendekanti Venkatasubbaiah,Indian National Congress,419077,252379,60.2 %,33022,13.10%,1962.0
2,Agra,433,GEN,Uttar Pradesh [1947 - 1999],Seth Achal Singh,Indian National Congress,433164,275663,63.6 %,54351,19.70%,1962.0
3,Ahmedabad,120,GEN,Gujarat,Indulal Kanaiyalal Yagnik,Nutan Maha Gujarat Janta Parisha,433392,270346,62.4 %,21592,8.00%,1962.0
4,Ahmednagar,245,GEN,Maharashtra,Motilal Kundanmal Firodya,Indian National Congress,403913,222091,55.0 %,14038,6.30%,1962.0


In [None]:
df['Pc_name'].describe()

count           8047
unique           895
top       Aurangabad
freq              30
Name: Pc_name, dtype: object

In [None]:
print(df)

             Pc_name   no type                        state  \
0           Adilabad   36  GEN               Andhra Pradesh   
1              Adoni   27  GEN               Andhra Pradesh   
2               Agra  433  GEN  Uttar Pradesh [1947 - 1999]   
3          Ahmedabad  120  GEN                      Gujarat   
4         Ahmednagar  245  GEN                  Maharashtra   
...              ...  ...  ...                          ...   
8042          Wardha    8  GEN                  Maharashtra   
8043         Wayanad    4  GEN                       Kerala   
8044      West Delhi    6  GEN         Delhi [1977 Onwards]   
8045  Yavatmal-Washi   14  GEN                  Maharashtra   
8046       Zahirabad    5  GEN                    Telangana   

                  candidate_name                             party   electors  \
0               G. Narayan Reddy          Indian National Congress   4,04,283   
1     Pendekanti Venkatasubbaiah          Indian National Congress   4,19,077   


In [None]:
#Encoding or feature encoding
dict_pc_name = {}
global curr
curr = 0
def My_Encoder(idx):
  if dict_pc_name.get(idx) is not None:
    return dict_pc_name.get(idx)
  else:
    temp = len(dict_pc_name)+1
    dict_pc_name[idx] = temp
    return temp
df['New_Pc_name'] = df['Pc_name'].apply(lambda i: My_Encoder(i))

The code starts by defining a dictionary dict_pc_name and a global variable curr that keeps track of the number of unique values in 'Pc_name' that have been encountered so far.

The function My_Encoder is then defined. It takes an argument idx, which is a value from the 'Pc_name' column. The function first checks if idx is already a key in the dict_pc_name dictionary. If it is, the function returns the corresponding value (i.e., the encoded value). If it is not, the function creates a new key-value pair in the dictionary, where the key is idx and the value is the current value of curr plus one. The function then returns the new value.

Finally, the apply method is used to apply the My_Encoder function to each value in the 'Pc_name' column. The resulting values are stored in the new column 'New_Pc_name'.

In [None]:
df1 = df.drop(['Pc_name','no','Turnout','margin','margin%','year'],axis=1)
# the columns Pc_name, no, Turnout, margin, margin%, and year are dropped from the original dataframe df.
# The parameter axis=1 specifies that the operation should be performed on columns.

In [None]:
df1.head()

Unnamed: 0,type,state,candidate_name,party,electors,votes,New_Pc_name
0,GEN,Andhra Pradesh,G. Narayan Reddy,Indian National Congress,404283,220383,1
1,GEN,Andhra Pradesh,Pendekanti Venkatasubbaiah,Indian National Congress,419077,252379,2
2,GEN,Uttar Pradesh [1947 - 1999],Seth Achal Singh,Indian National Congress,433164,275663,3
3,GEN,Gujarat,Indulal Kanaiyalal Yagnik,Nutan Maha Gujarat Janta Parisha,433392,270346,4
4,GEN,Maharashtra,Motilal Kundanmal Firodya,Indian National Congress,403913,222091,5


In [None]:
df1['state'].unique()

array(['Andhra Pradesh', 'Uttar Pradesh [1947 - 1999]', 'Gujarat',
       'Maharashtra', 'Rajasthan', 'Punjab', 'Kerala', 'Orissa', 'Madras',
       'West Bengal', 'Bihar [1947 - 1999]', 'Assam',
       'Madhya Pradesh [1947 - 1999]', 'Mysore', 'Himachal Pradesh',
       'Delhi', 'Manipur', 'Tripura', 'Haryana', 'Jammu & Kashmir',
       'Andaman & Nicobar Islands', 'Chandigarh', 'Dadra & Nagar Haveli',
       'Laccadive, Minicoy And Amindivi Islands', 'Goa, Daman And Diu',
       'Nagaland', 'Pondicherry', 'Tamil Nadu', 'Arunachal Pradesh',
       'Karnataka', 'Delhi [1977 Onwards]', 'Daman & Diu', 'Lakshadweep',
       'Mizoram', 'Meghalaya', 'Sikkim', 'Winning Candidate', 'Goa',
       'Uttar Pradesh [2000 Onwards]', 'Uttarakhand',
       'Bihar [2000 Onwards]', 'Madhya Pradesh [2000 Onwards]',
       'Chhattisgarh', 'Jharkhand', 'Telangana',
       'Andhra Pradesh [2014 Onwards]'], dtype=object)

In [None]:
#Data cleaning and preprocessing using regular expressions and label encoding
regex = r'(?<!\[)[^\[\]]+(?!\])'
def Changer(str1):
  matches = re.findall(regex, str1)
  temp = matches[0]
  temp = re.sub(r'^\s+|\s+$', '', temp)
  # print(matches[0])
  return temp
  
df1['state'] = df['state'].apply(lambda i: Changer(i))
encoder = LabelEncoder()
df1['New_State'] = encoder.fit_transform(df1['state'])
df1['New_type'] = encoder.fit_transform(df1['type'])
df1['New_candidate_name'] = encoder.fit_transform(df1['candidate_name'])
df1['New_party'] = encoder.fit_transform(df1['party'])

*   The Changer function uses regular expression to extract the state name from the state column in the dataframe.
*   The apply method is used to apply the Changer function to each row in the state column of the dataframe.
*   The LabelEncoder from scikit-learn is used to encode the categorical variables state, type, candidate_name, and party into numeric values.
*   The encoded variables are added to the df1 dataframe with names prefixed with "New_" to indicate they are the new encoded variables.

In [None]:
df1.head()

Unnamed: 0,type,state,candidate_name,party,electors,votes,New_Pc_name,New_State,New_type,New_candidate_name,New_party
0,GEN,Andhra Pradesh,G. Narayan Reddy,Indian National Congress,404283,220383,1,1,0,1667,59
1,GEN,Andhra Pradesh,Pendekanti Venkatasubbaiah,Indian National Congress,419077,252379,2,1,0,4036,59
2,GEN,Uttar Pradesh,Seth Achal Singh,Indian National Congress,433164,275663,3,37,0,5181,59
3,GEN,Gujarat,Indulal Kanaiyalal Yagnik,Nutan Maha Gujarat Janta Parisha,433392,270346,4,12,0,2127,106
4,GEN,Maharashtra,Motilal Kundanmal Firodya,Indian National Congress,403913,222091,5,23,0,3462,59


In [None]:
df2 = df1.drop(['state','type','candidate_name','party'],axis=1)
# the columns 'state','type','candidate_name','party are dropped from the dataframe df1.

In [None]:
df2.head()

Unnamed: 0,electors,votes,New_Pc_name,New_State,New_type,New_candidate_name,New_party
0,404283,220383,1,1,0,1667,59
1,419077,252379,2,1,0,4036,59
2,433164,275663,3,37,0,5181,59
3,433392,270346,4,12,0,2127,106
4,403913,222091,5,23,0,3462,59


In [None]:
df2.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8047 entries, 0 to 8046
Data columns (total 7 columns):
 #   Column              Non-Null Count  Dtype 
---  ------              --------------  ----- 
 0   electors            8047 non-null   object
 1   votes               8047 non-null   object
 2   New_Pc_name         8047 non-null   int64 
 3   New_State           8047 non-null   int64 
 4   New_type            8047 non-null   int64 
 5   New_candidate_name  8047 non-null   int64 
 6   New_party           8047 non-null   int64 
dtypes: int64(5), object(2)
memory usage: 440.2+ KB


In [None]:
# Conversion
def Converting(i):
  i = i.replace(",","")
  try:
    i = float(i)
  except:
    i = 0
  return i    

df2['votes_new'] = df2['votes'].apply(lambda i: Converting(i))
df2['electors_new'] = df2['electors'].apply(lambda i: Converting(i))

This code defines a function Converting which takes a string argument i. The function first replaces any commas in i with an empty string. Then, it attempts to convert the resulting string to a floating point number using the float() function. If the conversion fails due to an error, the function returns 0.

In [None]:
df2.head()

Unnamed: 0,electors,votes,New_Pc_name,New_State,New_type,New_candidate_name,New_party,votes_new,electors_new
0,404283,220383,1,1,0,1667,59,220383.0,404283.0
1,419077,252379,2,1,0,4036,59,252379.0,419077.0
2,433164,275663,3,37,0,5181,59,275663.0,433164.0
3,433392,270346,4,12,0,2127,106,270346.0,433392.0
4,403913,222091,5,23,0,3462,59,222091.0,403913.0


In [None]:
df3 = df2.drop(['electors','votes'],axis=1)
# dropping electors and votes from df2.

In [None]:
df3.head()

Unnamed: 0,New_Pc_name,New_State,New_type,New_candidate_name,New_party,votes_new,electors_new
0,1,1,0,1667,59,220383.0,404283.0
1,2,1,0,4036,59,252379.0,419077.0
2,3,37,0,5181,59,275663.0,433164.0
3,4,12,0,2127,106,270346.0,433392.0
4,5,23,0,3462,59,222091.0,403913.0


In [None]:
df3.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8047 entries, 0 to 8046
Data columns (total 7 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   New_Pc_name         8047 non-null   int64  
 1   New_State           8047 non-null   int64  
 2   New_type            8047 non-null   int64  
 3   New_candidate_name  8047 non-null   int64  
 4   New_party           8047 non-null   int64  
 5   votes_new           8047 non-null   float64
 6   electors_new        8047 non-null   float64
dtypes: float64(2), int64(5)
memory usage: 440.2 KB


In [None]:
df3.isna().sum()
# isna() function is used to detect missing values (NaN) in a DataFrame.

New_Pc_name           0
New_State             0
New_type              0
New_candidate_name    0
New_party             0
votes_new             0
electors_new          0
dtype: int64

In [None]:
from sklearn.preprocessing import MinMaxScaler

# create a MinMaxScaler object
scaler = MinMaxScaler()

df3['electors_new'] = scaler.fit_transform(df3[['electors_new']])
df3['votes_new'] = scaler.fit_transform(df3[['votes_new']])

In [None]:
df3.head

<bound method NDFrame.head of       New_Pc_name  New_State  New_type  New_candidate_name  New_party  \
0               1          1         0                1667         59   
1               2          1         0                4036         59   
2               3         37         0                5181         59   
3               4         12         0                2127        106   
4               5         23         0                3462         59   
...           ...        ...       ...                 ...        ...   
8042          489         23         0                4632         30   
8043          863         18         0                4344         59   
8044          864          9         0                5056         30   
8045          895         23         0                 744        127   
8046          866         35         0                 519        137   

      votes_new  electors_new  
0      0.124951      0.120022  
1      0.143092      0.124414

In [None]:
X = df3.drop(['votes_new'],axis=1)
Y = df3['votes_new']

In [None]:
# train-test split method.
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.2)

In [None]:
print("The Shape of X_train:- " + str(X_train.shape))
print("The Shape of Y_train:- " + str(y_train.shape))
print("The Shape of X_test:- " + str(X_test.shape))
print("The Shape of Y_test- " + str(y_test.shape))

The Shape of X_train:- (6437, 6)
The Shape of Y_train:- (6437,)
The Shape of X_test:- (1610, 6)
The Shape of Y_test- (1610,)


In [None]:
# Model Training
from sklearn.linear_model import LinearRegression  
regressor= LinearRegression()  
regressor.fit(X_train, y_train) #function estimates the coefficients of the linear regression line that best fit the training data using the Ordinary Least Squares (OLS) method.

LinearRegression()

In [None]:
y_pred= regressor.predict(X_test)

In [None]:
# Evaluating the performance of a linear regression model.
from sklearn.metrics import r2_score, mean_squared_error
print("R-squared:", r2_score(y_test, y_pred))
print("Mean Squared Error:", mean_squared_error(y_test, y_pred))

R-squared: 0.8214867896755529
Mean Squared Error: 0.005069151499417894


In [None]:
df3.head()

Unnamed: 0,New_Pc_name,New_State,New_type,New_candidate_name,New_party,votes_new,electors_new
0,1,1,0,1667,59,0.124951,0.120022
1,2,1,0,4036,59,0.143092,0.124414
2,3,37,0,5181,59,0.156293,0.128596
3,4,12,0,2127,106,0.153278,0.128664
4,5,23,0,3462,59,0.125919,0.119912


In [None]:
from sklearn import svm, datasets
regressor.predict([[2,1,0,4036,59,0.124414]])



array([0.13148427])

Saving the trained LinearRegression model as a binary file on disk using the pickle module.

In [None]:
import pickle

In [None]:
filename = 'PredExitPoll.pkl'
pickle.dump(regressor, open(filename, 'wb'))

In [None]:
loaded_model = pickle.load(open(filename,'rb'))
loaded_model.predict(X_test)

array([0.3519084 , 0.53127141, 0.3239722 , ..., 0.49500156, 0.3346348 ,
       0.15076586])